Introduction
This set of notes comes originally from a workshop that introduces participants to latent variable modeling techniques and graphical models more generally. However, it is now more or less a set of notes on graphical/latent variable modeling more generally, much of which may not be covered in the workshop in detail.
When first encountering these models, it may depend on one’s discipline how such models may be presented. The following will take a broad view initially, but then focus on what is usually referred to as structural equation modeling in the social and educational science literature. An attempt will be made to use a consistent, rather than discipline specific nomenclature and approach.
One of the goals of this set of notes is to not instill a false sense of comfort and/or familiarity with the techniques. No one is going to be an expert after a couple of afternoons with SEM. SEM and other techniques covered are typically taught over the course of a few weeks in a traditional statistics course, or given their own course outright. Instead, one of the primary goals here is to instill a firm conceptual foundation starting with common approaches (e.g. standard regression), while exposing the participant to a wide family of related techniques, any of which might be useful to one’s modeling and data situation.
Prerequisites for the workshop
Statistical
One should at least have a firm understanding of standard regression modeling techniques. If you are new to statistical analysis in general, I’ll be blunt and say you are probably not ready for SEM. SEM employs knowledge of maximum likelihood, multivariate analysis, measurement error, indirect effects etc., and none of this is typically covered in a first semester of statistics in many applied disciplines. SEM builds upon that basic regression foundation, and if that is not solid, SEM will probably be confusing at best.
Programming
SEM requires its own modeling language approach. As such, the syntax for Mplus and SEM specific programs, as well as SEM models within other languages or programs (e.g. R or Stata) are going to require you to learn something new.
Outline
Graphical Models
The workshop will start with the familiar, a standard regression model. It will then be presented as a graphical model, and extended to include indirect effects (e.g. \(\mathcal{A} \rightarrow \mathcal{B} \rightarrow \mathcal{C}\)). At this point we will discuss directed graphs, and demonstrate a more theoretically motivated approach (sometimes called path analysis), and compare it to a more flexible approach (Bayesian network) that does not require prespecification of paths nor specific directional relations. At this point, we’ll briefly discuss undirected graphs, with an example utilizing ‘network analysis’, though any exercise will likely be left to the participant’s own time.
Latent Variables
We will then discuss the notion of latent variables in the context of underlying causes or constructs and an understanding of the notion of measurement error. We will also note that latent variable models are actually even more broadly utilized when one includes other dimension reduction, or data compression, techniques, several of which fall under the heading of factor analysis. A few common techniques will be demonstrated such as ‘factor analysis’ and principal components analysis, and an overview will be provided for others. In addition, we will briefly discuss the relation of latent variable models to random effects models (something that will be demonstrated in more detail later), and note other places one might find latent variables (e.g. the EM algorithm, hidden Markov models).
SEM
The bulk of the rest of the workshop will focus on structural equation modeling. We will spend a good deal of time with measurement models first, comparing them to our previous efforts, and then extend those models to the case of regression with latent variables. There are many issues to consider when developing such models, and an attempt will be made to cover quite a bit of ground in that regard.
Others
Time may permit discussion of other topics within the SEM realm. For example latent growth curve models, are an alternative to a standard mixed model. Also, there are situations where the latent variable might be considered categorical, commonly called mixture models or cluster analysis, but in some specific contexts might go by other names (e.g. latent class analysis). Finally, an overview of other types of latent variable or structural equation models, such as item response theory, collaborative filtering etc. may also be provided.
Programming Language Choice

We will use R for this workshop for a variety of reasons. One is that all of the techniques mentioned thus far are fully developed within various R packages, often taking just a line or two of code to implement after the data has been prepped. Furthermore, it is freely available and will work on Windows, Mac and Linux. R is well-known as a powerful statistical modeling environment, and its flexible programming language allows for efficient data manipulation and exploration. Furthermore, it has a vast array of visualization capabilities. In short, it provides everything one might need in a single environment.
Among alternatives, Mplus is the most fully developed structural equation modeling package, and has been for years. However, it is a poor tool for data management, the University of Michigan does not have a campus-wide license for it, and most of its functionality (and all we will need for this workshop) is implemented within the lavaan family of R packages. Stata has recently provided SEM capabilities, but it is less well-developed (something that might change in time), and it still requires a license, making non-campus usage difficult or costly. SPSS Amos would perhaps be another alternative for some, but suffers the same licensing issues and is not as flexible as Mplus, historically it has lagged far behind Mplus in capabilities, and furthermore it is not supported for Unix operating systems such as Mac and Linux. Other alternatives include Proc Calis in SAS and OpenMX (another R package). Python implementation seems minimal presently. Historically, people used EQS and LISREL, but the former is no longer developed and the latter is no longer relatively widely used, as well as being a Windows-only application.
Setup for the workshop
For the following to go smoothly, you’ll need to complete the following steps precisely.
If you are not present or are bringing your laptop, you’ll need to have both R and Rstudio installed on whatever machine you’ll be using. If this will be a new experience for you, install R first, then Rstudio. For either you’ll need to choose the version appropriate to your operating system. As you go through the installation, for both just accept all defaults when prompted until the installation process begins. Once both are installed, you will only need to work with Rstudio, and it will at all times be assumed you will be using Rstudio during the workshop.
Once those are installed, proceed through the following steps.
Download this zipfile, and unzip its contents to an area on your machine that you have write access to. It contains the course contents, data, etc. in a folder that will serve as an Rstudio project folder.
Open Rstudio. File/Open Project/ then navigate to the folder contents you just unzipped. Click on the SEM file (should look like a blue icon, but otherwise is the SEM.Rproj file).
If the file is not already opened after opening the Rstudio project, File/Open File/my_code.R . Run the one line of code there and you’re set.
The lab for this workshop has Windows machines, and so the above is enough to proceed. For *nix systems, it’s probably easiest to just install the packages as we use them.
Color coding in text
- object or class of object
- function
- package
- important term
Graphical Modeling
A graphical model can be seen as a mathematical or statistical construct connecting nodes via edges. When pertaining to statistical models, the nodes might represent variables of interest in our data set, and edges specify the relationships among them. Visually they are depicted in the style of the following examples.

Any statistical model you’ve conducted can be expressed as a graphical model. As an example, the first graph with nodes X, Y, and Z might represent a regression model in which X and Z predict Y. The emoticon graph shows an indirect effect, and the 123 graph might represent a correlation matrix.
A key idea of a graphical model is that of conditional independence, something one should keep in mind when constructing their models. The concept can be demonstrated with the following graph.

In this graph, X is conditionally independent of Z given Y- there is no correlation between X and Z once Y is accounted for. We will revisit this concept when discussing path analysis and latent variable models. Graphs can be directed, undirected, or mixed. Directed graphs have arrows, sometimes implying a causal flow (a difficult endeavor to demonstrate explicitly) or noting a time component. Undirected graphs merely denote relations among the nodes, while mixed graphs might contain both directional and symmetric relationships. Most of the models discussed in this course will be directed or mixed.
Directed Graphs
As noted previously, we can represent standard models as graphical models. In most of these cases we’d be dealing with directed or mixed graphs. Almost always we are specifically dealing with directed acyclic graphs, where there are no feedback loops.
Standard linear model
Let’s start with the standard linear model (SLiM), i.e. a basic regression we might estimate via ordinary least squares (but not necessarily). In this setting, we want to examine the effect of each potential predictor (x* in the graph) on the target variable (y). The following shows what the graphical model might look like.

We start with the SLiM as a way to keep us thinking about the more complex models we will see later in the same way we do other models we might be more familiar with. In what follows, we’ll show that whether we use a standard R modeling approach (via the lm function), or an SEM approach (via the sem function in lavaan), the results are identical (aside from the fact that sem is using maximum likelihood).
mcclelland = haven::read_dta('data/path_analysis_data.dta')
lmModel = lm(math21 ~ male + math7 + read7 + momed, data=mcclelland)
Here we can do the same model using the lavaan package, and while the input form will change a bit, and the output will be presented in a manner consistent with sem, the estimated parameters are identical. Note that the residual standard error in lm is the square root of the variance estimate in the lavaan output.
library(lavaan)
model = "
math21 ~ male + math7 + read7 + momed
"
semModel = sem(model, data=mcclelland, meanstructure = TRUE)
summary(lmModel)
Call:
lm(formula = math21 ~ male + math7 + read7 + momed, data = mcclelland)
Residuals:
Min 1Q Median 3Q Max
-6.9801 -1.2571 0.1376 1.4544 5.7471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.84310 1.16418 4.160 4.08e-05 ***
male 1.20609 0.25831 4.669 4.44e-06 ***
math7 0.31306 0.04749 6.592 1.76e-10 ***
read7 0.08176 0.01638 4.991 9.81e-07 ***
momed -0.01684 0.06651 -0.253 0.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.329 on 324 degrees of freedom
(101 observations deleted due to missingness)
Multiple R-squared: 0.258, Adjusted R-squared: 0.2488
F-statistic: 28.16 on 4 and 324 DF, p-value: < 2.2e-16
lavaan (0.5-20) converged normally after 21 iterations
Used Total
Number of observations 329 430
Estimator ML
Minimum Function Test Statistic 0.000
Degrees of freedom 0
Minimum Function Value 0.0000000000000
Parameter Estimates:
Information Expected
Standard Errors Standard
Regressions:
Estimate Std.Err Z-value P(>|z|)
math21 ~
male 1.206 0.256 4.705 0.000
math7 0.313 0.047 6.642 0.000
read7 0.082 0.016 5.030 0.000
momed -0.017 0.066 -0.255 0.799
Intercepts:
Estimate Std.Err Z-value P(>|z|)
math21 4.843 1.155 4.192 0.000
Variances:
Estimate Std.Err Z-value P(>|z|)
math21 5.341 0.416 12.826 0.000
R-Square:
Estimate
math21 0.258
As we will see in more detail later, SEM incorporates more complicated regression models, but at this point it has the exact same interpretation as our standard regression. As we go along, we can see the models as generalizations of those we are already well acquainted with, and so one can use that prior knowledge as a basis for understanding the newer content.
Path Analysis
Path Analysis, and thus SEM, while new to some, is in fact a very, very old technique, statistically speaking. It can be seen as a generalization of the SLiM approach that can allow for indirect effects and multiple target variables. Path analysis also has a long history in the econometrics literature though under different names (e.g. instrumental variable regression, 2-stage least squares etc.), and through the computer science realm through the use of graphical models more generally. As such, there are many tools at your disposal for examining such models, and I’ll iterate that much of the SEM perspective on modeling comes largely from specific disciplines, while other approaches may be better for your situation.
Types of relationships
The types of potential relationships examined by path analysis can be seen below. 
Aside: Tracing rule
In a recursive model, implied correlations between two variables, X1 and X2, can be found using tracing rules. Implied correlations between variables in a model are equal to the sum of the product of all standardized coefficients for the paths between them. Valid tracings are all routes between X1 and X2 that a) do not enter the same variable twice and b) do not enter a variable through an arrowhead and leave through an arrowhead. The following examples assume the variables have been standardized (variance values equal to 1), if standardization has not occurred the variance of variables passed through should be included in the product of tracings.
Consider the following variables, A, B, and C (in a dataframe called abc) with a model seen in the below diagram. We are interested in identifying the implied correlation between x and z by decomposing the relationship into its different components and using tracing rules.
A B C
A 1.0 0.2 0.3
B 0.2 1.0 0.7
C 0.3 0.7 1.0
model = "
C ~ A + B
B ~ A
"
pathMod = sem(model, data=abc)
coef(pathMod)
C~A C~B B~A C~~C B~~B
0.167 0.667 0.200 0.478 0.950

To reproduce the correlation between A and C (sometimes referred to as a ‘total effect’):
- Corr = ac + ab * ac
- Corr = 0.167 + 0.133
- Corr = 0.3
In SEM models, it’s important to consider how well our model-implied correlations correspond to the actual observed correlations. For over-identified models, the correlations will not be reproduced exactly, and thus can serve as a measure of how well our model fits the data. More on this later.
Multiple Targets
While relatively seldom used, multivariate linear regression is actually very straightforward in some programming environments such as R. Using the McClelland data, let’s try it for ourselves. First, let’s look at the data to get a sense of things.
vars n mean sd median trimmed mad min max range skew kurtosis se
attention4 1 430 17.93 3.05 18 18.03 2.97 9 25 16 -0.31 -0.17 0.15
math21 2 364 11.21 2.69 11 11.30 2.97 3 17 14 -0.26 -0.18 0.14
college 3 286 0.37 0.48 0 0.34 0.00 0 1 1 0.52 -1.74 0.03
vocab4 4 386 10.18 2.53 10 10.23 2.97 4 17 13 -0.18 -0.37 0.13
math7 5 397 10.73 2.76 11 10.69 2.97 4 19 15 0.13 -0.47 0.14
read7 6 390 31.57 8.05 30 30.87 8.90 18 61 43 0.79 0.41 0.41
read21 7 360 73.67 8.52 76 74.69 7.41 35 84 49 -1.60 3.88 0.45
adopted 8 430 0.49 0.50 0 0.48 0.00 0 1 1 0.06 -2.00 0.02
male 9 430 0.55 0.50 1 0.56 0.00 0 1 1 -0.19 -1.97 0.02
momed 10 419 14.83 2.03 15 14.77 1.48 10 21 11 0.11 -0.45 0.10
college_missing 11 430 0.33 0.47 0 0.29 0.00 0 1 1 0.70 -1.52 0.02
While these are not the most strongly correlated variables to begin with, one plausible model might try to predict math and reading at age 21 with measures taken at prior years. The coefficients in the output with ~1 are the intercepts.
model = "
read21 ~ attention4 + vocab4 + read7
math21 ~ attention4 + math7
read21 ~~ 0*math21
"
mvregModel = sem(model, data=mcclelland, missing='listwise', meanstructure = T)
coef(mvregModel)
read21~attention4 read21~vocab4 read21~read7 math21~attention4 math21~math7 read21~~read21
0.128 0.377 0.537 0.091 0.347 51.837
math21~~math21 read21~1 math21~1
5.965 50.290 5.856
The last line of the model code clarifies that we are treating math21 and read21 as independent. We can compare this to standard R regression. A first step is taken to make the data equal to what was used in lavaan. For that we can use the dplyr package to select the necessary variables for the model, and then omit rows that have any missing.
library(dplyr)
mcclellandComplete = select(mcclelland, read21, math21, attention4, vocab4, read7, math7) %>%
na.omit
lm(read21 ~ attention4 + vocab4 + read7, data=mcclellandComplete)
Call:
lm(formula = read21 ~ attention4 + vocab4 + read7, data = mcclellandComplete)
Coefficients:
(Intercept) attention4 vocab4 read7
50.2904 0.1275 0.3770 0.5368
lm(math21 ~ attention4 + math7, data=mcclellandComplete)
Call:
lm(formula = math21 ~ attention4 + math7, data = mcclellandComplete)
Coefficients:
(Intercept) attention4 math7
5.85633 0.09103 0.34732
However, we can and probably should estimate the covariance of math and reading skill at age 21. Let’s rerun the path analysis removing that constraint.
model = "
read21 ~ attention4 + vocab4 + read7
math21 ~ attention4 + math7
"
mvregModel = sem(model, data=mcclelland, missing='listwise', meanstructure = T)
coef(mvregModel)
read21~attention4 read21~vocab4 read21~read7 math21~attention4 math21~math7 read21~~read21
0.140 0.388 0.494 0.092 0.330 51.958
math21~~math21 read21~~math21 read21~1 math21~1
5.968 3.202 51.316 6.020
We can see now that the coefficients are now slightly different from the SLiM approach. The read21~~math21 value represents the residual covariance between math and reading at age 21, i.e. after accounting for the other covariate relationships modeled, it tells us how correlated those skills are. Using summary will show it to be statistically significant (not shown here).
Whether or not to take a multivariate/path-analytic approach vs. separate regressions is left to the researcher. Assuming multivariate normality is perhaps less tenable than using the assumption for a single variable, and it’s easy to conduct univariate models. But as the above shows, it doesn’t take much to take into account correlated target variables.
Indirect Effects
So path analysis allows for multiple target variables, with the same or a mix of covariates for each target. What about indirect effects? Normal regression models examine direct effects only, and the regression coefficients reflect that direct effect. However, perhaps we think a particular covariate causes some change in another, which then causes some change in the target variable. This is especially true when some measures are collected at different time points. Note that in SEM, any variable in which an arrow is pointing to it in the graphical depiction is often called an endogenous variable, while those that only have arrows going out from them are exogenous. Exogenous variables may still have (unanalyzed) correlations among them. As we will see later, both observed and latent variables may be endogenous or exogenous.
Consider the following model.
Here we posit attention span and vocabulary at age 4 as indicative of what to expect for reading skill at age 7, and that is ultimately seen as a precursor to adult reading ability. In this model, attention span and vocabulary at 4 only have an indirect effect on adult reading ability through earlier reading skill. At least temporally it makes sense, so let’s code this up.
model = "
read21 ~ read7
read7 ~ attention4 + vocab4
"
mediationModel = sem(model, data=mcclelland)
summary(mediationModel, rsquare=TRUE)
lavaan (0.5-20) converged normally after 21 iterations
Used Total
Number of observations 305 430
Estimator ML
Minimum Function Test Statistic 6.513
Degrees of freedom 2
P-value (Chi-square) 0.039
Parameter Estimates:
Information Expected
Standard Errors Standard
Regressions:
Estimate Std.Err Z-value P(>|z|)
read21 ~
read7 0.559 0.050 11.152 0.000
read7 ~
attention4 0.270 0.151 1.791 0.073
vocab4 0.488 0.186 2.629 0.009
Variances:
Estimate Std.Err Z-value P(>|z|)
read21 53.047 4.296 12.349 0.000
read7 66.779 5.408 12.349 0.000
R-Square:
Estimate
read21 0.290
read7 0.035
What does this tell us? As before, we interpret the results as we would any other regression model, though conceptually there are two sets of models to consider (though they are estimated simultaneously), one for reading at age 7 and one for reading at age 21. And indeed, one can think of path analysis as a series of linked regression models. Here we have positive relationships between attention and vocab on reading at age 7, and a positive effect of reading at age 7 on reading at age 21. Statistically speaking, our model appears to be viable, as there appear to be statistically significant estimates (or nearly so) for each path.
However, look at the R2 value for reading at age 7. We now see that there are actually no practical effects of the age 4 variables at all, as all we are accounting for is < 4% of the variance, and all that we have really discovered is that prior reading ability affects later reading ability.
We can test the indirect effect itself by labeling the paths. In the following code, I label them based on the first letter of the variables involved (e.g. vr refers to the vocab to reading path), but note that these are arbitrary names. I also add the direct effects of the early age variable. While the indirect effect for vocab is statistically significant, as we already know there is not a strong correlation between these two variables, it’s is largely driven by the strong relationship between reading at age 7 and reading at age 21, which is probably not all that interesting. A comparison of AIC values, something we’ll talk more about later, would favor a model with only direct effects.
model = "
read21 ~ rr*read7 + attention4 + vocab4
read7 ~ ar*attention4 + vr*vocab4
# Indirect effects
att4_read21 := ar*rr
vocab4_read21 := vr*rr
"
mediationModel = sem(model, data=mcclelland)
summary(mediationModel, rsquare=TRUE, fit=T, std=T)
lavaan (0.5-20) converged normally after 28 iterations
Used Total
Number of observations 305 430
Estimator ML
Minimum Function Test Statistic 0.000
Degrees of freedom 0
Model test baseline model:
Minimum Function Test Statistic 121.544
Degrees of freedom 5
P-value 0.000
User model versus baseline model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -3602.058
Loglikelihood unrestricted model (H1) -3602.058
Number of free parameters 7
Akaike (AIC) 7218.115
Bayesian (BIC) 7244.157
Sample-size adjusted Bayesian (BIC) 7221.957
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent Confidence Interval 0.000 0.000
P-value RMSEA <= 0.05 1.000
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Information Expected
Standard Errors Standard
Regressions:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
read21 ~
read7 (rr) 0.536 0.050 10.607 0.000 0.536 0.515
attentin4 0.134 0.134 0.998 0.318 0.134 0.015
vocab4 0.381 0.166 2.299 0.021 0.381 0.044
read7 ~
attentin4 (ar) 0.270 0.151 1.791 0.073 0.270 0.033
vocab4 (vr) 0.488 0.186 2.629 0.009 0.488 0.059
Variances:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
read21 51.926 4.205 12.349 0.000 51.926 0.695
read7 66.779 5.408 12.349 0.000 66.779 0.965
R-Square:
Estimate
read21 0.305
read7 0.035
Defined Parameters:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
att4_read21 0.145 0.082 1.766 0.077 0.145 0.017
vocab4_read21 0.261 0.102 2.552 0.011 0.261 0.030
In the original article, I did not find their description or diagrams of the models detailed enough to know precisely what the model was in the actual study, but here is at least one interpretation if you’d like to examine it further.
modReading = "
read21 ~ read7 + attention4 + vocab4 + male + adopted + momed
read7 ~ attention4 + vocab4 + male + adopted + momed
"
reading = sem(modReading, data=mcclelland, missing='fiml', mimic = 'Mplus', std.ov=TRUE)
summary(reading, rsquare=TRUE)
A note about terminology: some refer to models with indirect effects as mediation models, and that terminology appears commonly in the SEM (and esp. psychology) literature (along with the notion of ‘buffering’). Many applied researchers starting out with SEM often confuse the term with moderation, which is called an interaction in every other modeling context. As you start out, referring to indirect effects and interactions will likely keep you clear on what you’re modeling, and perhaps be clearer to those who may not be as familiar with SEM. Also, in moderation models, one will often see some variable denoted as ‘the moderator’, but this is completely arbitrary. In an interaction, it makes just as much sense to say that the A-Y relationship varies as a function of B as it does the B-Y relationship varies as a function of A.
Cavets about indirect effects
One should think very hard about positing an indirect effect, especially if there is no time component or obvious causal path. If the effect isn’t immediately obvious, then one should probably just examine the direct effects. Unlike other standard explorations one might do with models (e.g. look for nonlinear relationships), the underlying causal connection is more explicit in this context. Many models I’ve seen in consulting struck me as arbitrary as far as which covariate served as the mediator, required a fairly convoluted explanation for its justification, or ignored other relevant variables because the reasoning would have to include a plethora of indirect effects if it were to be logically consistent. Furthermore, I can often ask one or two questions and will discover that researchers are actually interested in interactions (i.e. moderation), rather than indirect effects.
This document will not get into models that have moderated mediation and mediated moderation. In my experience these are often associated with models that are difficult to interpret at best, or are otherwise not grounded very strongly in substantive concerns. However, there are times, e.g. in experimental settings, which surprisingly few SEM are applied to, where it would be very much appropriate. It is left to the reader to investigate those particular complications when the time comes.
Bayesian Networks
In many cases of path analysis, the path model is not strongly supported by prior research or intuition, and people are also often willing to use modification indices after the fact to change the paths in their model. This is unfortunate, as their model is generally overfit to begin with, and more so if altered in such an ad hoc fashion.
A more exploratory approach to graphical modeling is available however. Bayesian Networks are an alternative to graphical modeling of the sort we’ve been doing. Though they can be used to produce exactly the same results that we obtain with path analysis via maximum likelihood estimation, they can also be used for constrained or wholly exploratory endeavors as well, with regularization in place to keep from overfitting.
As an example, I use the McClelland data to explore potential paths via the bnlearn package. I make the constraints that variables later in time do not effect variables earlier in time, no variables are directed toward background characteristics like sex, and at least for these purposes I keep math and reading at a particular time from having paths to each other. I show some of the so-called blacklist of constraints, and use the bnlearn package for the model.
from to
1 read21 read7
2 read21 math7
3 read21 attention4
4 read21 vocab4
5 read21 adopted
6 read21 male
library(bnlearn)
model = gs(mcclellandNoCollege, blacklist = blacklist, test='mi-g-sh')
# plot(model)
The plot of the model results shows that attention span at age 4 has no useful relationship to the other variables, something we’d already suspected based on previous models, and even could guess at the outset given its low correlations. As it has no connections, I’ve dropped it from the visualization. Furthermore, the remaining paths make conceptual sense. The parameters, fitted values, and residuals can be extracted with the bn.fit function, and other diagnostic plots, cross-validation and prediction on new data are also available.
We won’t get into the details of these models except to say that one should have them in their tool box. And if one really is in a more exploratory situation, the tools available would typically come with methods far better suited for it than the SEM software approach. The discovery process with Bayesian networks can also be a lot of fun. Even if one has strong theory, nature is always more clever than we are, and you might find something interesting.
Undirected Graphs
So far we have been discussing directed graphs in which the implied causal flow tends toward one direction and there are no feedback loops. However, sometimes the goal is not so much to estimate the paths as it is to find the structure. Undirected graphs simply specify the relations of nodes with edges, but without any directed arrows regarding the relationship.
While we could have used the bnlearn package for an undirected graph by adding the argument undirected = T, there are a slew of techniques available for what is often called network analysis. Often the focus is on observations, rather than variables, and what influences whether one sees a tie or not, with modeling techniques available for predicting ties (e.g. Exponential Random Graph models). Often these are undirected graphs and that is our focus here, but they do not have to be.
Network analysis
Networks can be seen everywhere. Personal relationships, machines and devices, various business and academic units… we can analyze the connections among any number of things. A starting point for a very common form of network analysis is an adjacency matrix, which represents connections among items we wish to analyze. Often it is just binary 0-1 values where 1 represents a connection. Any similarity matrix could potentially be used (e.g. a correlation matrix). Here is a simple example of an adjacency matrix:
| Bernadette |
1 |
0 |
1 |
1 |
1 |
1 |
| David |
0 |
1 |
1 |
0 |
0 |
1 |
| Josh |
1 |
1 |
1 |
0 |
1 |
0 |
| Lagia |
1 |
0 |
0 |
1 |
0 |
1 |
| Mancel |
1 |
0 |
1 |
0 |
1 |
0 |
| Nancy |
1 |
1 |
0 |
1 |
0 |
1 |
Visually, we can see the connections among the nodes.
As an example of a network analysis, let’s look at how states might be more or less similar on a few variables. We’ll use the state.x77 data set in base R. It is readily available, no need for loading. To practice your R skills, use the function str on the state.x77 object to examine its structure, and head to see the first 6 rows, and ? to find out more about it.
Here are the correlations of the variables.
The following depicts a graph of the states based on the variables of Life Expectancy, Median Income, High School Graduation Rate, and Illiteracy. The colors represent the results of a community detection algorithm, and serves to show the clustering is not merely geographical, though one cluster is clearly geographically oriented (‘the South’).
Understanding Networks
Networks can be investigated in an exploratory fashion or lead to more serious modeling approaches. The following is a brief list of common statistics or modeling techniques.
Centrality
- Degree: how many links a node has (can also be ‘indegree’ or ‘outdegree’ for directed graphs)
- Closeness: how close a node is to other nodes
- Betweenness: how often a node serves as a bridge between the shortest path between two other nodes
- PageRank: From Google, a measure of node ‘importance’
- Hub: a measure of the value of a node’s links
- Authority: another measure of node importance
Characteristics of the network as a whole may also be examined, e.g. degree distribution, ‘clusteriness’, average path length etc.
Cohesion
Investigate how network members create communities and cliques. This is similar to cluster analysis used in other situations. Some nodes may be isolated.
Modeling
- ERGM: exponential random graph models, regression modeling for network data
- Other link analysis
Comparison
A goal might be to compare multiple networks to see if they differ in significant ways.
Dynamics
While many networks are ‘static’, many others change over time. One might be interested in this structural change by itself, or modeling something like link loss.
Nonrecursive Models
Recursive models have all unidirectional causal effects and disturbances are not correlated. A model is considered nonrecursive if there is a reciprocal relationship, feedback loop, or correlated disturbance in the model. Nonrecursive models are potentially problematic when there is not enough information to estimate this model (unidentified model).
A classic example of a nonrecursive relationship is marital satisfaction: the more satisfied one partner is, the more satisfied the other, and vice versa. This can be represented by a simple model (below).

Such models are notoriously difficult to specify in terms of identification, which we will talk more about later. For now, we can simply say the above model would not even be estimated as there are more parameters to estimate (two paths, two variances) than there is information in the data (two variances and one covariance).
To make this model identified, one approach is to use what are called instrumental variables. Instrumental variables directly influence one of the variables in a recursive relationship, but not the other. For example, a wife’s education can influence her satisfaction directly and a husband’s education can influence his satisfaction directly, but a husband’s education cannot directly impact a wife’s satisfaction and vice versa (at least for this demonstration). These instrumental variables can indirectly impact a spouses’ satisfaction (below). The dashed line represents an unanalyzed correlation.

Many instances of nonrecursive models might better be represented by a correlation. One must have a very strong theoretical motivation for such models, which is probably why they aren’t seen as often in the SEM literature, though they are actually quite common in some areas such as economics.
Summary
Path analysis in SEM is a form of theoretically motivated graphical model involving only observed variables. They might include indirect effects and multiple outcomes of interest. However, path analysis is a special case of a more broad set of graphical modeling tools widely used in many disciplines, any of which might be useful for a particular data situation.
R packages used
- lavaan
- bnlearn
- network
- mediation (not used but one to note)
Latent Variables
Not everything we want to measure comes with an obvious yardstick. If one wants to measure something like a person’s happiness, what would they have at their disposal?
- Are they smiling?
- Did they just get a pay raise?
- Are they interacting well with others?
- Are they relatively healthy?
Any of these might be useful as an indicator of their current state of happiness, but of course none of them would tell us whether they truly are happy or not. At best, they can be considered imperfect measures. If we consider those and other indicators collectively, perhaps we can get an underlying measure of something we might call happiness, contentment, or some other arbitrary but descriptive name.
Despite how they are typically used within psychology, educational and related fields, the use of latent variable models are actually seen all over, and in ways that may have little to do with what we will be focusing on in this course. Broadly speaking, factor analysis can be seen as a dimension reduction technique, or as an approach to modeling measurement error and understanding underlying constructs. We will give some description of the former while focusing on the latter.
Dimension Reduction/Compression
Many times we simply have the goal of taking a whole lot of variables, reducing them to much fewer, but while retaining as much information about the originals as possible. For example, this is an extremely common goal in areas of image and audio compression. Statistical techniques amenable to these approaches are commonly referred to as matrix factorization.
Principal Components Analysis
Probably the most commonly used factor-analytic technique is principal components analysis (PCA). It seeks to extract components from a set of variables, with each component containing as much of the original variance as possible. Components can be seen as a linear combination of the original variables.
PCA works on a covariance/correlation matrix, and it will return as many components as there are variables that go into it, each subsequent component accounting for less variance than the previous component, and summing up to 100% of the total variance in the original data. With appropriate steps, the components can completely reproduce the original correlation matrix. However, as the goal is dimension reduction, we only want to retain some of these components, and so the reproduced matrix will not be exact. This however gives us some sense of a goal to shoot for, and the same idea is also employed in factor analysis/SEM, where we also work with the covariance matrix and prefer models that can closely reproduce the original matrix seen with the observed data.
Visually, we can display PCA as a graphical model. Here is one with four components/variables. The size of the components represents the amount of variance each accounts for.
Let’s see an example. The following regards a correlation matrix of 24 psychological tests given to 145 seventh and eight-grade children in a Chicago suburb by Holzinger and Swineford. We’ll use the psych package, and to use the principal function, we provide the data (available via the psych package), specify the number of components/factors we want to retain, and other options (in this case, the rotated solution will be a little more interpretable). We will use the psych package here as it gives us a little more output than standard PCA packages and functions, and one that is more consistent with the factor analysis technique we’ll spend time with later. While we will use lavaan for factor analysis to be consistent with the SEM approach, the psych package is a great tool for standard factor analysis, assessing reliability and other fun stuff.
library(psych)
pc = principal(Harman74.cor$cov, nfactors=4, rotate='varimax')
pc
Principal Components Analysis
Call: principal(r = Harman74.cor$cov, nfactors = 4, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC3 RC2 RC4 h2 u2 com
VisualPerception 0.16 0.71 0.23 0.14 0.60 0.40 1.4
Cubes 0.09 0.59 0.08 0.03 0.37 0.63 1.1
PaperFormBoard 0.14 0.66 -0.04 0.11 0.47 0.53 1.2
Flags 0.25 0.62 0.09 0.03 0.45 0.55 1.4
GeneralInformation 0.79 0.15 0.22 0.11 0.70 0.30 1.3
PargraphComprehension 0.81 0.18 0.07 0.21 0.73 0.27 1.2
SentenceCompletion 0.85 0.15 0.15 0.06 0.77 0.23 1.1
WordClassification 0.64 0.31 0.24 0.11 0.57 0.43 1.8
WordMeaning 0.84 0.16 0.06 0.19 0.78 0.22 1.2
Addition 0.18 -0.13 0.83 0.12 0.76 0.24 1.2
Code 0.18 0.05 0.63 0.37 0.57 0.43 1.8
CountingDots 0.02 0.17 0.80 0.05 0.67 0.33 1.1
StraightCurvedCapitals 0.18 0.41 0.62 0.03 0.59 0.41 1.9
WordRecognition 0.23 -0.01 0.06 0.68 0.52 0.48 1.2
NumberRecognition 0.12 0.08 0.05 0.67 0.48 0.52 1.1
FigureRecognition 0.06 0.46 0.05 0.58 0.55 0.45 1.9
ObjectNumber 0.14 0.01 0.24 0.68 0.54 0.46 1.4
NumberFigure -0.02 0.32 0.40 0.50 0.51 0.49 2.7
FigureWord 0.14 0.25 0.20 0.42 0.30 0.70 2.4
Deduction 0.43 0.43 0.09 0.30 0.47 0.53 2.8
NumericalPuzzles 0.18 0.42 0.50 0.17 0.49 0.51 2.5
ProblemReasoning 0.42 0.41 0.13 0.29 0.45 0.55 3.0
SeriesCompletion 0.42 0.52 0.25 0.20 0.55 0.45 2.7
ArithmeticProblems 0.40 0.14 0.55 0.26 0.55 0.45 2.5
RC1 RC3 RC2 RC4
SS loadings 4.16 3.31 3.22 2.74
Proportion Var 0.17 0.14 0.13 0.11
Cumulative Var 0.17 0.31 0.45 0.56
Proportion Explained 0.31 0.25 0.24 0.20
Cumulative Proportion 0.31 0.56 0.80 1.00
Mean item complexity = 1.7
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.06
Fit based upon off diagonal values = 0.97
First focus on the last portion of the output where it says SS loadings . The first line is the sum of the squared loadings for each component (in this case where we are using a correlation matrix, summing across all 24 components would equal the value of 24). The Proportion Var tells us how much of the overall variance the component accounts for out of all the variables (e.g. 4.16 / 24 = 0.17). The Cumulative Var tells us that all 4 components make up over half the variance. The others are the same thing just based on the four components rather than all 24 variables. We can see that each component accounts for a decreasing amount of variance.
Loadings in this scenario represent the estimated correlation of an item with its component, and provide the key way in which we interpret the factors. However, we have a lot of them, and rather than interpret that mess in our output, we’ll look at it visually. In the following plot, stronger loadings are indicated by blue, and we can see the different variables associated with different components.
Interpretation is the fun but commonly difficult part. As an example, we can see PC2 as indicative of mathematical ability, but in general this isn’t the cleanest result.
Some explanation of the other parts of the output:
- h2: the amount of variance in the item explained by the (retained) components. It is the sum of the squared loadings (a.k.a. communality).
- u2: 1 - h2
- com: A measure of complexity. A value of 1 might be seen for something that loaded on only one component, and zero otherwise (a.k.a. perfect simple structure)
We can get a quick graphical model display as follows:
## fa.diagram(pc, cex=.5)
PCA is probably not the best choice in this scenario, nor likely, is a 4 factor solution. One of the primary reasons is that this graphical model assumes the observed variables are measured without error. In addition, the principal components do not correlate with one another, but it seems likely that we would want to allow the latent variables to do so (a different rotation would allow this). However, if our goal is merely to reduce the 24 items to a few that account for the most variance, this would be a standard technique.
Other Matrix Factorization Techniques
Before leaving PCA for factor analysis of the sort we’ll mostly be concerned with, I’ll mention other matrix factorization techniques that might be of use depending on your data situation.
- SVD: singular value decomposition. Works on a raw data matrix rather than covariance matrix, and is still a very viable technique that may perform better in a lot of situations relative to fancier latent variable models, and other more recently developed techniques. Variations on SVD are behind some recommender systems of the sort you come across at Amazon, Netflix etc.
- ICA: Independent components analysis. Extracts non-normal, independent components. The primary goal is to create independent latent variables.
- Generalized PCA: PCA techniques for different types of data, e.g. binary data situations.
- PC Regression: combining PCA with regression in one model.
- NMF: non-negative matrix factorization. Applied to positive valued matrices, produces positive valued factors. Useful, for example, when dealing with counts.
- LSI: Latent Semantic Indexing, an early form of topic modeling.
- Many others.
Factor Analysis
Factor analysis is a general technique for uncovering latent variables within data. While initially one might think it similar to PCA, one difference from PCA is that the goal is not to recover maximum variance with each component. However, we will move beyond factor analysis as a dimension reduction technique (and fully ‘exploratory’ technique, see below), and instead present it as an approach with a strong theoretical underpinning, and one that can help us assess measurement error, ultimately even leading to regression models utilizing the latent variables themselves.
So let us turn to what are typically called measurement models within SEM. The underlying model can be thought of as a case in which the observed variables, in some disciplines referred to as indicators of the latent construct (also manifest variables), are caused by the latent variable. The degree to which the observed variables correlate with one another depends in part on how much of the underlying (presumed) latent variable they reliably measure.
For each indicator we can think of a regression model as follows, where \(\beta_0\) is the intercept and \(\lambda\) the regression coefficient that expresses the effect of the latent variable \(F\) on the observed variable \(X\) (I do not note the error).
\[X = \beta_0 + \lambda F\]
We will almost always have multiple indicators, and often multiple latent variables. Some indicators may be associated with multiple factors.
\[\begin{aligned}
X_1 &= \beta_{01} + \lambda_{11} F_1 + \lambda_{21} F_2 \\
X_2 &= \beta_{02} + \lambda_{12} F_1 + \lambda_{22} F_2 \\
X_3 &= \beta_{03} + \lambda_{13} F_1
\end{aligned}\]
It is important to understand this regression model, because many who engage in factor analysis seemingly do not, and often think of it the other way around, where the observed variables cause the latent. In factor analysis, the \(\lambda\) coefficients are called loadings (as they were in PCA), but are interpreted as any other regression coefficient- a one unit change in the latent variable results in a \(\lambda\) change in the observed variable. Most factor models assume that, controlling for the latent variable, the observed variables are independent (recall our previous discussion on conditional independence in graphical models), though this is sometimes relaxed.
Exploratory vs. Confirmatory
An unfortunate and unhelpful distinction in some disciplines is that of exploratory vs. confirmatory factor analysis (and even exploratory SEM). In any regression analysis, there is a non-zero correlation between any variable and some target variable. We don’t include everything for theoretical (and even practical) reasons, which is akin to fixing its coefficient to zero, and here it is no different. Furthermore, most modeling endeavors could be considered exploratory, regardless of how the model is specified. As such, this distinction doesn’t tell us anything about the model, and is thus unnecessary in my opinion.
As an example, \(X_3\) is not modeled by \(F_2\), which is the same as fixing the \(\lambda_{23}\) coefficient for \(F2\) to 0, but that doesn’t tell me whether the model is exploratory or not, and yet that is all the distinction refers to, namely, whether we let all factors load with all indicators or not. An analysis doesn’t suddenly have more theoretical weight, validity etc. due to the paths specified.
Example
Let’s see a factor analysis in action. The motivating example for this section comes from the National Longitudinal Survey of Youth (1997, NLSY97), which investigates the transition from youth to adulthood. For this example, we will investigate a series of questions asked to the participants in 2006 pertaining to the government’s role in promoting well-being. Questions regarded the government’s responsibility for the following: providing jobs for everyone, keeping prices under control, providing health care, providing for elderly, helping industry, providing for unemployed, reducing income differences, providing college financial aid, providing decent housing, protecting the environment. Each item has four values 1:4, which range from ‘definitely should be’ to ‘definitely should not be’. We’ll save this for the exercise.
There are also three items regarding their emotional well-being (depression)- how often the person felt down or blue, how often they’ve been a happy person, and how often they’ve been depressed in the last month. These are also four point scales and range from ‘all of the time’ to ‘none of the time’. We’ll use this here.
depressed = read.csv('data/nlsy97_depressedNumeric.csv')
library(lavaan)
modelCode = "
depressed =~ FeltDown + BeenHappy + DepressedLastMonth
"
famod = cfa(modelCode, data=depressed)
summary(famod, standardized=T)
lavaan (0.5-20) converged normally after 19 iterations
Used Total
Number of observations 7183 8985
Estimator ML
Minimum Function Test Statistic 0.000
Degrees of freedom 0
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
depressed =~
FeltDown 1.000 0.541 0.813
BeenHappy -0.732 0.020 -37.329 0.000 -0.396 -0.609
DeprssdLstMnth 0.719 0.019 37.992 0.000 0.388 0.655
Variances:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
FeltDown 0.150 0.007 20.853 0.000 0.150 0.339
BeenHappy 0.266 0.006 46.489 0.000 0.266 0.629
DeprssdLstMnth 0.201 0.005 41.606 0.000 0.201 0.571
depressed 0.292 0.010 30.221 0.000 1.000 1.000
Raw results
In a standard measurement model such as this we must scale the factor by fixing one of the indicator’s loadings to one. This is done for identification purposes, so that we can estimate the latent variable variance. Which variable is selected for scaling is arbitrary, but doing so means that the sum of the latent variable variance and the residual variance of the variable whose loading is fixed to one equals the variance of that observed variable.

var(depressed$FeltDown, na.rm=T) # .29 + .15
[1] 0.441856
Standardized latent variable
An alternative way to scale the latent variable is to simply fix its variance to one (the std.lv=TRUE results). It does not need to be estimated, allowing us to obtain loadings for each observed variable.
Standardized latent and observed
With both standardized (using the summary function, set standardized=T), these loadings represent correlations between the observed and latent variables. This is the default output in the factor analysis we’d get from non-SEM software (i.e. ‘exploratory’ FA). If one is just doing a factor-analytic model, these loadings are typically reported. Standardized coefficients in a CFA are computed by taking the unstandardized coefficient (loading) and multiplying it by the model implied standard deviation of the indicator then dividing by the latent variable’s standard deviation. Otherwise, one can simply use standaridized variables in the analysis, or supply only the correlation matrix.

If you’d like to peel back the curtain and see maximum likelihood estimation based on the raw (covariance) and standardized (correlation) scales, with a comparison to lavaan output, click here.
Constructs and Measurement models
Scale development
A good use of factor analysis regards scale development. If we come up with 10 items that reflect some underlying construct, factor analysis can provide a sense of how well the scale is put together. Recall that in path analysis, residual variance, sometimes called disturbance, reflects both omitted causes as well as measurement error. In this context, \(1-R^2_{item}\) provides a sense of how unreliable the item is. A perfectly reliable item would have a value of zero. Strong loadings indicate a strong relationship between the item and the underlying construct.
Factor Scores
In factor analysis, we can obtain estimated factor scores for each observation, possibly to be used in additional analyses or examined in their own right. One common way is to simply use the loadings as one would regression weights/coefficients (actually scaled versions of them), and create a ‘predicted’ factor score as the linear combination of the indicator variables, just as we would in regression.
vs. Means/Sums
In many occasions, people reduce the number of variables in a model by using a mean or sum score. These actually can be seen to reflect an underlying factor analysis where all loadings are fixed to be equal and the residual variance of the observed variables is fixed to zero, i.e. perfect measurement. If you really think the items reflect a particular construct, you’d probably be better off using a score that comes from a model that doesn’t assume perfect measurement.
vs. Composites
Composites scores are what we’d have if we turned the arrows around, and allowed different weights for the different variables, which may not be similar too similar in nature or necessarily correlated (e.g. think of how one might construct a measure of socioeconomic status). Unlike a simple mean, these would have different weights associated with the items. PCA is one way one could create such a composite.

Some Other Uses of Latent Variables
EM algorithm: A very common technique to estimate model parameters for a variety of model situations, it incorporates a latent variable approach where parameters of interest are treated as a latent variable (e.g. probability of belonging to some cluster).
Item Response Theory: uses latent variables, especially in test situations (though is much broader), to assess things like item discrimination, student ability etc.
Hidden Markov Model: A latent variable model approach commonly used for time series.
Topic Model: In the analysis of text, one can discover latent ‘topics’ based on the frequency of words.
Collaborative Filtering: For example, in recommender systems for movies or music, the latent variable might represent genre.
Summary
Latent variable approaches are a necessary tool to have in your statistical toolbox. Whether your goal is to compress data or explore underlying constructs, ‘factor-analysis’ will serve you well.
Structural Equation Modeling
Structural equation modeling combines the path analytic and latent variable techniques together to allow for regression models among latent variables. Any model, even the SLiM, can be seen as some form of SEM. However, the term is typically reserved for the combination of latent and observed variables in a model.
Measurement model
The measurement model refers to the latent variable models, i.e. factor analysis, and typical practice in SEM is to investigate these separately and first. The reason is that one wants to make sure that the measurement model holds before going any further with the underlying constructs. For example, for one’s sample of data one might detect two latent variables work better for a set of indicators, or might find that some indicators are performing poorly.
Structural model
The structural model specifies the relations among latent and observed variables that do not serve as indicators. It can become quite complex, but at this stage one can lean on what they were exposed to with path analysis, as conceptually we’re in the same place, except now some variables may be latent.
Example
The following model is a classic example from Wheaton et al. (1977), which used longitudinal data to develop a model of the stability of alienation from 1967 to 1971, accounting for socioeconomic status as a covariate. Each of the three factors have two indicator variables, SES in 1966 is measured by education and occupational status in 1966 and alienation in both years is measured by powerlessness and anomia. The structural component of the model hypothesizes that SES in 1966 influences both alienation in 1967 and 1971 and alienation in 1967 influences the same measure in 1971. We also let the disturbances correlate from one timepoint to the next.
wheaton.model <- '
# measurement model
ses =~ education + sei
alien67 =~ anomia67 + powerless67
alien71 =~ anomia71 + powerless71
# structural model
alien71 ~ aa*alien67 + ses
alien67 ~ sa*ses
# correlated residuals
anomia67 ~~ anomia71
powerless67 ~~ powerless71
# Indirect effect
IndirectEffect := sa*aa
'

The standardized results of the structural model are shown below, and model results below that. In this case, the structural paths are statistically significant, as is the indirect effect specifically. Higher socioeconomic status is affiliated with less alienation, while there is a notable positive relationship of prior alienation with later alienation. We are also accounting for roughly 50% of the variance in 1971 alienation. Colors represent positive vs. negative weights, and the closer to zero the more faded they are.

lavaan (0.5-20) converged normally after 73 iterations
Number of observations 932
Estimator ML
Minimum Function Test Statistic 4.735
Degrees of freedom 4
P-value (Chi-square) 0.316
Model test baseline model:
Minimum Function Test Statistic 2133.722
Degrees of freedom 15
P-value 0.000
User model versus baseline model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 0.999
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -15213.274
Loglikelihood unrestricted model (H1) -15210.906
Number of free parameters 17
Akaike (AIC) 30460.548
Bayesian (BIC) 30542.783
Sample-size adjusted Bayesian (BIC) 30488.792
Root Mean Square Error of Approximation:
RMSEA 0.014
90 Percent Confidence Interval 0.000 0.053
P-value RMSEA <= 0.05 0.930
Standardized Root Mean Square Residual:
SRMR 0.007
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err Z-value P(>|z|)
ses =~
education 1.000
sei 5.219 0.422 12.364 0.000
alien67 =~
anomia67 1.000
powerless67 0.979 0.062 15.895 0.000
alien71 =~
anomia71 1.000
powerless71 0.922 0.059 15.498 0.000
Regressions:
Estimate Std.Err Z-value P(>|z|)
alien71 ~
alien67 (aa) 0.607 0.051 11.898 0.000
ses -0.227 0.052 -4.334 0.000
alien67 ~
ses (sa) -0.575 0.056 -10.195 0.000
Covariances:
Estimate Std.Err Z-value P(>|z|)
anomia67 ~~
anomia71 1.623 0.314 5.176 0.000
powerless67 ~~
powerless71 0.339 0.261 1.298 0.194
Variances:
Estimate Std.Err Z-value P(>|z|)
education 2.801 0.507 5.525 0.000
sei 264.597 18.126 14.597 0.000
anomia67 4.731 0.453 10.441 0.000
powerless67 2.563 0.403 6.359 0.000
anomia71 4.399 0.515 8.542 0.000
powerless71 3.070 0.434 7.070 0.000
ses 6.798 0.649 10.475 0.000
alien67 4.841 0.467 10.359 0.000
alien71 4.083 0.404 10.104 0.000
R-Square:
Estimate
education 0.708
sei 0.412
anomia67 0.600
powerless67 0.726
anomia71 0.649
powerless71 0.692
alien67 0.317
alien71 0.497
Defined Parameters:
Estimate Std.Err Z-value P(>|z|)
IndirectEffect -0.349 0.041 -8.538 0.000
Issues in SEM
Identification
Identification generally refers to the problem of finding a unique estimate of the value for each parameter in the model. Consider the following: \[ a + b = 2\]
There is no way for us to determine a unique solution for \(a\) and \(b\), e.g. the values of 1 and 1 work just as well as -1052 and 1054 and infinite other combinations. We can talk about 3 basic scenarios, and the problem generally regards how much information we have (in terms of (co)variances) vs. how many parameters we want to estimate in the model.
- A model which has an equal number of observations (again, in terms of (co)variances) and parameters to estimate would have zero degrees of freedom, and is known as a just identified model. In a just identified model there are no extra degrees of freedom leftover to test model fit.
- Underidentified models are models where it is not possible to find a unique estimate for each parameter. These models may have negative degrees of freedom or problematic model structures, as in the example above, and you’ll generally know right away there is a problem as whatever software package will note an error, warning, or not provide output.
- Overidentified models have positive degrees of freedom, meaning there is more than enough pieces of information to estimate each parameter. It is desirable to have overidentified models as it allows us to use other measures of model fit.
Consider the following CFA example in which we try to estimate a latent variable model with only two observed variables, as would be the case in the prior Alienation measurement models if they are examined in isolation. We have only two variances and one covariance to estimate two paths, the latent variable variance and the two residual variances. By convention, a path is always fixed at 1 to scale the latent variable, but that still leaves us with four parameters to estimate with only three pieces of information, hence the -1 degrees of freedom and other issues in the output.
modelUnder = 'LV =~ x1 + x2'
modelJust = 'LV =~ x1 + x2 + x3'
underModel = cfa(modelUnder, data=cbind(x1,x2,x3))

lavaan (0.5-20) converged normally after 10 iterations
Number of observations 100
Estimator ML
Minimum Function Test Statistic NA
Degrees of freedom -1
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err Z-value P(>|z|)
LV =~
x1 1.000
x2 0.677 NA
Variances:
Estimate Std.Err Z-value P(>|z|)
x1 0.409 NA
x2 0.740 NA
LV 0.614 NA
Now if we had a third variable, we now have six pieces of information to estimate and still seven unknowns. Again though, we usually fix the first loading to 1, so it would be estimated. An alternative approach would be to fix the factor variance to some value (typically 1 to create a standardized latent variable). This will allow us to estimate a unique value for each path.
Even so, this is the just identified situation, and so the model runs, but we won’t have any fit measures because we can perfectly reproduce the observed correlation matrix.
justModel = cfa(modelJust, data=cbind(x1,x2,x3))

summary(justModel, fit=T)
lavaan (0.5-20) converged normally after 21 iterations
Number of observations 100
Estimator ML
Minimum Function Test Statistic 0.000
Degrees of freedom 0
Minimum Function Value 0.0000000000000
Model test baseline model:
Minimum Function Test Statistic 90.742
Degrees of freedom 3
P-value 0.000
User model versus baseline model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -388.626
Loglikelihood unrestricted model (H1) -388.626
Number of free parameters 6
Akaike (AIC) 789.251
Bayesian (BIC) 804.882
Sample-size adjusted Bayesian (BIC) 785.933
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent Confidence Interval 0.000 0.000
P-value RMSEA <= 0.05 1.000
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err Z-value P(>|z|)
LV =~
x1 1.000
x2 1.412 0.281 5.021 0.000
x3 1.767 0.384 4.605 0.000
Variances:
Estimate Std.Err Z-value P(>|z|)
x1 0.728 0.113 6.427 0.000
x2 0.434 0.113 3.858 0.000
x3 0.214 0.151 1.422 0.155
LV 0.294 0.112 2.627 0.009
Note that in the full alienation model, we have 6*7/2 = 21 variances and covariances, which provides enough to estimate the parameters of the model.
Determining identification is difficult for any complex model. Necessary conditions include there being model degrees of freedom \(\geq 0\), and scaling all latent variables, but they are not sufficient. In general though, it is enough to know conceptually what the issue is and how the information you have relates to what you can estimate.
Fit
There are many, many measures of model fit for SEM, and none of them will give you a definitive answer as to how your model is doing. Your assessment, if you use them, is to take a holistic approach to get a global sense of how your model is doing. Let’s look again at the alienation results.
fitMeasures(alienation, c('chisq', 'df', 'pvalue', 'cfi', 'rmsea', 'srmr', 'AIC'))
chisq df pvalue cfi rmsea srmr aic
4.735 4.000 0.316 1.000 0.014 0.007 30460.548
Chi-square test
Conceptually the chi-square test measures the discrepancy between the observed correlations and those implied by the model. In the graphical model section, we actually gave an example of reproducing a correlation from a path analysis. In general, the model goal is to reproduce them as closely as we can. It compares the fitted model with a (saturated) model that does not have any degrees of freedom. The degrees of freedom for this test are equal to the data (variances + covariances) minus the number of parameters estimated. A non-significant \(\chi^2\) suggests our predictions are not statistically different from those we observe, so yay!
Estimator ML
Minimum Function Test Statistic 4.735
Degrees of freedom 4
P-value (Chi-square) 0.316
Or not. Those familiar with null-hypothesis testing know that one cannot accept a null hypothesis, and attempting to do so is fundamentally illogical. Other things that affect this measure specifically include multivariate non-normality, the size of the correlations (larger ones are typically related to larger predicted vs. observed discrepancies), unreliable measures (can actually make this test fail to reject), and sample size (same as with any other model scenario and statistical significance).
So if it worries you that a core measure of model fit in SEM is fundamentally problematic, good. As has been said before, no single measure will be good enough on its own, so gather as much info as you can. Some suggest to pay more attention to the \(\chi^2\) result, but to me, the flawed logic is something that can’t really be overcome. If you use it with appropriate null hypothesis testing logic, a significant \(\chi^2\) test can tell you that something is potentially wrong with the model.
- Note that lavaan also provides a Chi-square test which compares the current model to a model in which all paths are zero, and is essentially akin to the likelihood ratio test we might use in standard model settings (e.g. comparing against an intercept only model). For that test, we want a statistically significant result. However, one can specify a prior model conducted with lavaan to test against specifically (i.e. to compare nested models).
CFI etc.
The comparative fit index compares the fitted model to a null model that assumes there is no relationship among the measured items. CFI values larger than .9 or especially .95 are desired. Others include the Tucker-Lewis Fit Index, which is provided in standard lavaan output, but there are more incremental fit indices where those come from.
User model versus baseline model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 0.999
RMSEA
The root mean squared error of approximation is a measure that also centers on the model-implied vs. sample covariance matrix, and, all else being equal, is lower for simpler models and larger sample sizes. Look for values less than .05. Lavaan also provides a one-sided test that the RMSE is \(\leq .05\), which ideally would be high, but the confidence interval is enough for reporting purposes.
Root Mean Square Error of Approximation:
RMSEA 0.014
90 Percent Confidence Interval 0.000 0.053
P-value RMSEA <= 0.05 0.930
Standardized Root Mean Square Residual:
SRMR 0.007
SRMR
The standardized root mean squared residual is the mean absolute correlation residual, i.e. the difference between the observed and model-implied correlations. Historical suggestions are to also look for values less than .05, but it is better to simply inspect the residuals and note where there are large discrepancies.
residuals(alienation, type='cor')$cor
eductn sei anom67 pwrl67 anom71 pwrl71
education 0.000
sei 0.000 0.000
anomia67 0.007 -0.020 0.000
powerless67 -0.006 0.018 0.000 0.000
anomia71 0.007 -0.017 0.000 0.001 0.000
powerless71 -0.001 0.001 -0.001 0.000 0.000 0.000
Fit Summarized
A brief summary of these and other old/typical measures of fit are described here. However they all have issues, and one should never use cutoffs as a basis for your ultimate thinking about model performance. Studies have been done and all the fit indices have issues in various scenarios, and the cutoffs do not hold up under scrutiny. While they can provide some assistance in the process, they are not meant to overcome a global assessment of theory-result compatibility.
Model Comparison
All of the above, while rooted in model comparison approaches, are by themselves only providing information about the fit or lack thereof regarding the current model. In any modeling situation, SEM or otherwise, a model comparison approach is to be preferred. Even when we don’t have the greatest model, being able to choose among viable options can help science progress.
AIC
AIC is a good way to compare models in SEM just as it is elsewhere, where a penalty is imposed on the likelihood based on the number of parameters estimated, i.e. model complexity. The value by itself is not useful, but the ‘better’ model will have a lower value. A natural question arises as to how low is low enough to prefer one model over another, but this is impossible to answer because the value of AIC varies greatly with the data and model in question. However, this frees you to know which is ‘better’, at least in terms of AIC, while still allowing you to consider the merits of the model even so. If the lower AIC is associated with the simpler model, you’d be hard-pressed to justify not taking it.
BIC
One can probably ignore the BIC in this context. This isn’t actually Bayesian, and if you are using a Bayesian approach, WAIC or DIC would be appropriate. If you aren’t using a Bayesian approach, then AIC would likely be preferable in most circumstances. The BIC has a different penalty than AIC, and is not a measure based on predictive performance, which is what we typically want in model selection.
Example
Let’s compare the previous model to one without the indirect effect and in which the SES and Alienation contributions are independent (i.e. just make the previous code change to alien67 ~ 0*ses). We’ll use the semTools package for easy side by side comparison.
library(semTools)
## compareFit(alienation, alienationNoInd)
################### Nested Model Comparison #########################
chi df p delta.cfi
alienation - alienationNoInd 200.28 1 <.001 0.0941
| alienation |
4.735 |
4 |
.316 |
.000 |
.999 |
30460.548 |
30542.783 |
.014 |
.007 |
| alienationNoInd |
205.017 |
5 |
.000 |
.906 |
.717 |
30658.830 |
30736.227 |
.207 |
.173 |
The first result is a likelihood ratio test. The model with no path is nested within the model with a path and so this is a viable option. It tells us essentially that adding the indirect path results in a statistically significantly better model. In terms of fit indices, the model including the indirect effect is better. So now we can say that not only does our model appear to fit the data well, but is better than a plausible competitor.
Prediction
While the fitted correlation matrix is nice to be able to obtain, it has always struck me a bit odd that one can’t even predict the actual data with typical SEM software. Part of this is due to the fact that the models regard the covariance matrix as opposed to the raw data, and that is the focus in many SEM situations. But in path analysis, measurement models, and SEM where mean structures are of focus (e.g. growth curves), it stands to reason that one would like to get predicted values and/or be able to test a model on new data. Even in more complex models, predictions can be made by fixing parameters at estimated values and supplying new data.
Lavaan at least does do this for you, and its lavPredict function allows one to get predicted values for both latent and observed variables, for the current or new data. There is also a fit index, obtainable through the fitMeasures function, called Expected Value of Cross-Validation Index (or ECVI), that speaks to this notion. In addition, the semTools package is a great resource for comparing models generally, comparing models across groups, model simulation and so forth.
Observed covariates
Just to be clear, SEM doesn’t only have to be about structural relations among latent variables. At any point observed covariates can be introduced to the structural model as well, and this is common practice. As an example, the alienation model is fundamentally wrong, as it doesn’t include many background or other characteristics we’d commonly collect on individuals and which might influence feelings of alienation.
Interactions
Interactions among both observed and latent variables can be included in SEM, and have the same interpretation as they would in any regression model. A common term for this in SEM is moderation. While, many depictions in SEM suggest that one variable moderates another, just like with standard interactions it is arbitrary whether one says A interacts with/moderates B or vice versa, and this fact doesn’t change just because we are doing SEM. For latent variables, one can think of adding a latent variable whose indicator variables consists of product terms of the indicators for the latent variables we want to have an interaction. See indProd and probe2WayMC in the semTools package.
Estimation
In everything demonstrated thus far, we have been using standard maximum likelihood to estimate the parameters. This often may not be the best choice. The following list comes from the MPlus manual, and most of these are available in lavaan.
- ML: maximum likelihood parameter estimates with conventional standard errors and chi-square test statistic
- MLM: maximum likelihood parameter estimates with standard errors and a mean-adjusted chi-square test statistic that are robust to non-normality. The chi-square test statistic is also referred to as the Satorra-Bentler chi-square.
- MLMV: maximum likelihood parameter estimates with standard errors and a mean- and variance-adjusted chi-square test statistic that are robust to non-normality
- MLR: maximum likelihood parameter estimates with standard errors and a chi-square test statistic (when applicable) that are robust to non-normality and non-independence of observations when used with TYPE=COMPLEX. The MLR standard errors are computed using a sandwich estimator. The MLR chi-square test statistic is asymptotically equivalent to the Yuan-Bentler T2* test statistic.
- MLF: maximum likelihood parameter estimates with standard errors approximated by first-order derivatives and a conventional chi-square test statistic
- MUML: Muthén’s limited information parameter estimates, standard errors, and chi-square test statistic
- WLS: weighted least square parameter estimates with conventional standard errors and chi-square test statistic that use a full weight matrix. The WLS chi-square test statistic is also referred to as ADF when all outcome variables are continuous.
- WLSM: weighted least square parameter estimates using a diagonal weight matrix with standard errors and mean-adjusted chi-square test statistic that use a full weight matrix
- WLSMV: weighted least square parameter estimates using a diagonal weight matrix with standard errors and mean- and variance-adjusted chi-square test statistic that use a full weight matrix
- ULS: unweighted least squares parameter estimates
- ULSMV: unweighted least squares parameter estimates with standard errors and a mean- and variance-adjusted chi-square test statistic that use a full weight matrix
- GLS: generalized least square parameter estimates with conventional standard errors and chi-square test statistic that use a normal-theory based weight matrix
- Bayes: Bayesian posterior parameter estimates with credibility intervals and posterior predictive checking
Missing data
A lot of data of interest in applications of SEM have missing values. Two common approaches to dealing with this are Full Information Maximum Likelihood (FIML) and Multiple Imputation (MI), and both are generally available in SEM packages. This is far too detailed an issue to treat adequately here, though we can take a moment to describe the approach generally. FIML uses the available information in the data (think pairwise correlations). MI uses a process to estimate the raw data values, and to adequately account for the uncertainty in those guesses, it creates multiple versions of complete data sets, each with different estimates of the missing values. The SEM model is run on all of them and estimates combined across all models (e.g. the mean path parameter). The imputation models, i.e. those used to estimate the missing values, can be any sort of regression model, including using variables not in the SEM model.
In addition, Bayesian approaches can estimate the missing values as additional parameters in the model (in fact, MI is essentially steeped within a Bayesian perspective). Also there may additional concerns when data is missing over time, i.e. longitudinal dropout. Using the lavaan package is nice because it comes with FIML, and the semTools package adds MI.
Other SEM approaches
SEM is very flexible and applicable to a wide variety of modeling situations. Some of these will be covered in their own module (e.g. mixture models, growth curve modeling).
How to fool yourself with SEM
Kline’s third edition text listed over 50(!) ways in which one could fool themselves with SEM, which speaks to the difficulty in running SEM and dealing with all of its issues. I will note a handful of some of them to keep in mind in particular.
Sample size
If you don’t have at least a thousand observations, you will probably only be able to conduct (possibly unrealistically) simple SEM models, or just the measurement models for scale development, or only structural models with observed variables (path analysis). Even with simpler modeling situations, one should have several hundred observations. In the simple alienation model above, we already are dealing with 17 parameters, and it doesn’t include any background covariates of the individuals, which is unrealistic. Furthermore, because it’s a mediation model, adding such variables might require additional direct and indirect paths, time-varying covariates that would have effects at both years, etc., and the number of parameters could balloon quite quickly.
One will see many articles of published research with low sample sizes using SEM. This doesn’t make it correct to do so, and one should be highly suspicious of the results suggested in those papers as they are overfit or not including relevant information.
Poor data
If the correlations among the data are low, one isn’t going to magically have strong effects by using SEM. I have seen many clients running these models and who are surprised that they don’t turn out well, when a quick glance at the correlation matrix would have suggested that there wasn’t much to work with in the first place.
Naming a latent variable doesn’t mean it exists
While everything may turn out well for one’s measurement model, and the results in keeping with theory, that doesn’t make it so. This is especially so with less reliable measures. Latent constructs require operational definitions and other considerations in order to be useful, and rule out that one isn’t simply measuring something else, or that it makes sense that such a construct has real (physical ties).
As an example, many diagnoses in the Diagnostic and Statistical Manual of Mental Disorders have not even been shown to exist via a statistical approach like SEM, while others are simply assumed to exist, and even presumably (subsequently) supported by measurement models (often with low N), only to be shown to have no ties to any underlying physiology.
Ignoring diagnostics
Ignoring residuals, warning messages, etc. is a sure path to trying to interpret nonsensical results. Look at your residuals, fitted values etc. If your SEM software of choice is giving you messages, find out what they mean, because it may be very important.
Summary
SEM is a powerful modeling approach that generalizes many other techniques, but it simply cannot be used lightly. Strong theory, strong data, and a lot of data can potentially result in quite interesting models that have a lot to say about the underlying constructs of interest. Go into it with competing ideas, and realize that your theory is wrong from the outset, even if there is evidence that it isn’t way off.
Latent Growth Curves
Latent growth curve (LGC) models are in a sense, just a different form of the very commonly used mixed model framework. In some ways they are more flexible, mostly in the standard structural equation modeling framework that allows for indirect, and other complex covariate relationships. In other ways, they are less flexible (e.g. requiring balanced data, estimating nonlinear relationships, data with many time points, dealing with time-varying covariates). With appropriate tools there is little one can’t do with the normal mixed model approach relative to the SEM approach, and one would likely have easier interpretation. As such I’d recommend sticking with the standard mixed model framework unless you really need to.
That said, growth curve models are a very popular SEM technique, so it makes sense to become familiar with them. To best understand a growth curve model, I still think it’s instructive to see it from the mixed model perspective, where things are mostly interpretable from what you know from a SLiM.
Random effects
Often data is clustered, e.g. students within schools or observations for individuals over time. The standard linear model assumes independent observations, and in these situations we definitely do not have that.
One very popular way to deal with these are a class of models called mixed effects models, or simply mixed models. They are mixed, because there is a mixture of fixed effects and random effects. The fixed effects are the regression coefficients one has in standard modeling approaches.
The random effects allow each cluster to have its own unique effect in addition to the overall fixed effect. This is simply a random deviation, almost always normally distributed in practice, from the overall intercept and slopes. Mixed models are a balanced approach between ignoring these unique contributions, and over-contextualizing by running separate regressions for every cluster.
Random Effects in SEM
As we’ve seen with other models, the latent variables are assumed normally distributed, usually with zero mean, and some estimated variance. Well so are the random effects in mixed models, and it’s through this that we can maybe start to get a sense of random effects as latent variables (or vice versa). Indeed, mixed models have ties to many other kinds of models (e.g. spatial, additive), because they too add a ‘random’ component to the model in some fashion.
Simulating Random Effects
Through simulation we can demonstrate conceptual understanding, and will be well on our way toward better understanding LGC models. We’ll have balanced data with four time-points for 500 individuals (subjects).
set.seed(1234)
n = 500
timepoints = 4
time = rep(0:3, times=n)
subject = rep(1:n, each=4)
We’ll have ‘fixed’ effects, i.e. our standard regression intercept and slope, set at .5 and .25 respectively. We’ll allow their associated subject-specific effects to have a slight correlation (.2), and as such we’ll draw them from a multivariate normal distribution (variance of 1 for both effects).
intercept = .5
slope = .25
randomEffectsCorr = matrix(c(1,.2,.2,1), ncol=2)
randomEffects = MASS::mvrnorm(n, mu=c(0,0), Sigma = randomEffectsCorr, empirical=T) %>%
data.frame()
colnames(randomEffects) = c('Int', 'Slope')
Let’s take a look at the data thus far. Note how I’m using subject as a row index. This will spread out the n random effects to n*timepoints total, while being constant within a subject.
data.frame(Subject=subject, time=time, randomEffects[subject,]) %>%
head(20)
Subject time Int Slope
1 1 0 -1.4774750 0.4536349
1.1 1 1 -1.4774750 0.4536349
1.2 1 2 -1.4774750 0.4536349
1.3 1 3 -1.4774750 0.4536349
2 2 0 0.6390051 -0.9525007
2.1 2 1 0.6390051 -0.9525007
2.2 2 2 0.6390051 -0.9525007
2.3 2 3 0.6390051 -0.9525007
3 3 0 0.7736171 1.1377251
3.1 3 1 0.7736171 1.1377251
3.2 3 2 0.7736171 1.1377251
3.3 3 3 0.7736171 1.1377251
4 4 0 -2.1972780 -1.0066772
4.1 4 1 -2.1972780 -1.0066772
4.2 4 2 -2.1972780 -1.0066772
4.3 4 3 -2.1972780 -1.0066772
5 5 0 -0.1922775 1.8472234
5.1 5 1 -0.1922775 1.8472234
5.2 5 2 -0.1922775 1.8472234
5.3 5 3 -0.1922775 1.8472234
Now, to get a target variable, we simply add the random effects for the intercept to the overall intercept, and likewise for the slopes. We’ll throw in some noise at the end.
sigma = .5
y1 = (intercept + randomEffects$Int[subject]) + (slope + randomEffects$Slope[subject])*time + rnorm(n*timepoints, mean=0, sd=sigma)
d = data.frame(subject, time, y1)
head(d)
subject time y1
1 1 0 -1.5801417
2 1 1 -0.1231067
3 1 2 -0.3397778
4 1 3 1.4511151
5 2 0 1.4904810
6 2 1 -0.5164371
Let’s estimate this as a mixed model first. See if you can match the parameters from our simulated data to the output.
library(lme4)
mixedModel = lmer(y1 ~ time + (1 + time|subject), data=d) # 1 represents the intercept
## summary(mixedModel)
Linear mixed model fit by REML ['lmerMod']
Formula: y1 ~ time + (1 + time | subject)
Data: d
REML criterion at convergence: 5833.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.36211 -0.48276 0.02046 0.47515 2.84524
Random effects:
Groups Name Variance Std.Dev. Corr
subject (Intercept) 0.9999 0.9999
time 0.9898 0.9949 0.21
Residual 0.2382 0.4881
Number of obs: 2000, groups: subject, 500
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.48986 0.04830 10.141
time 0.26400 0.04555 5.796
Our fixed effects are the values we set for the overall intercept and slope. The estimated random effects variances are at 1, the correlation near .2, and finally, our residual standard deviation is near the .5 value we set.
Running a Growth Curve Model
As before, we’ll use lavaan, but now the syntax will look a bit strange compared to what we’re used to, because we have to fix the factor loadings to specific values in order to make it work. This also leads to non-standard output, as there is nothing to estimate for the many fixed parameters. Furthermore, our data needs to be in wide format, as opposed to the long format we used in the previous mixed model.
subject time y1
1 1 0 -1.5801417
2 1 1 -0.1231067
3 1 2 -0.3397778
4 1 3 1.4511151
5 2 0 1.4904810
6 2 1 -0.5164371
library(tidyr)
dWide = d %>%
spread(time, y1)
# change the names, as usually things don't work well if they start with a number
colnames(dWide)[-1] = paste0('y', 0:3)
model = "
# intercept and slope with fixed coefficients
i =~ 1*y0 + 1*y1 + 1*y2 + 1*y3
s =~ 0*y0 + 1*y1 + 2*y2 + 3*y3
"
growthCurveModel = growth(model, data=dWide)
summary(growthCurveModel)
lavaan (0.5-20) converged normally after 42 iterations
Number of observations 500
Estimator ML
Minimum Function Test Statistic 10.616
Degrees of freedom 5
P-value (Chi-square) 0.060
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err Z-value P(>|z|)
i =~
y0 1.000
y1 1.000
y2 1.000
y3 1.000
s =~
y0 0.000
y1 1.000
y2 2.000
y3 3.000
Covariances:
Estimate Std.Err Z-value P(>|z|)
i ~~
s 0.226 0.050 4.512 0.000
Intercepts:
Estimate Std.Err Z-value P(>|z|)
y0 0.000
y1 0.000
y2 0.000
y3 0.000
i 0.487 0.048 10.072 0.000
s 0.267 0.045 5.884 0.000
Variances:
Estimate Std.Err Z-value P(>|z|)
y0 0.287 0.041 6.924 0.000
y1 0.219 0.021 10.501 0.000
y2 0.185 0.027 6.748 0.000
y3 0.357 0.065 5.485 0.000
i 0.977 0.076 12.882 0.000
s 0.969 0.065 14.841 0.000
Most of the output is blank, which is needless clutter. There are only the same five value we are interested in though.
Start with the intercepts:
Intercepts:
Estimate Std.Err Z-value P(>|z|)
i 0.487 0.048 10.072 0.000
s 0.267 0.045 5.884 0.000
It might be odd to call your fixed effects ‘intercepts’, but it make sense if we are thinking of it as a multilevel model as depicted previously, where we actually broke out the random effects as a separate model. The estimates here are pretty much spot on with our mixed model estimates, which are identical to just the standard regression estimates.
(Intercept) time
0.4898598 0.2640034
Call:
lm(formula = y1 ~ time, data = d)
Coefficients:
(Intercept) time
0.4899 0.2640
Now let’s look at the variance estimates. The estimation of residual variance for each y in the LGC distinguishes the two approaches, but not necessarily so. We could fix them to zero here or allow them to be estimated in the mixed model framework. Just know that’s why the results are not identical (to go along with their respective estimation approaches, which are also different by default). Again though, the variances are near one, and the correlation between the intercepts and slopes is around the .2 value.
Covariances:
Estimate Std.Err Z-value P(>|z|)
i ~~
s 0.226 0.050 4.512 0.000
Variances:
Estimate Std.Err Z-value P(>|z|)
y0 0.287 0.041 6.924 0.000
y1 0.219 0.021 10.501 0.000
y2 0.185 0.027 6.748 0.000
y3 0.357 0.065 5.485 0.000
i 0.977 0.076 12.882 0.000
s 0.969 0.065 14.841 0.000
Groups Name Std.Dev. Corr
subject (Intercept) 0.99994
time 0.99488 0.208
Residual 0.48806
The differences provide some insight. LGC by default assumes heterogeneous variance for each time point. Mixed models by default assume the same \(\sigma^2\) for each time point, but can allow them to be estimated separately in most modeling packages.
As an example, if we fix the variances to be equal, the models are now identical.
model = "
# intercept and slope with fixed coefficients
i =~ 1*y0 + 1*y1 + 1*y2 + 1*y3
s =~ 0*y0 + 1*y1 + 2*y2 + 3*y3
y0 ~~ resvar*y0
y1 ~~ resvar*y1
y2 ~~ resvar*y2
y3 ~~ resvar*y3
"
growthCurveModel = growth(model, data=dWide)
summary(growthCurveModel)
summary(mixedModel, corr=F)
lavaan (0.5-20) converged normally after 27 iterations
Number of observations 500
Estimator ML
Minimum Function Test Statistic 17.105
Degrees of freedom 8
P-value (Chi-square) 0.029
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err Z-value P(>|z|)
i =~
y0 1.000
y1 1.000
y2 1.000
y3 1.000
s =~
y0 0.000
y1 1.000
y2 2.000
y3 3.000
Covariances:
Estimate Std.Err Z-value P(>|z|)
i ~~
s 0.207 0.050 4.170 0.000
Intercepts:
Estimate Std.Err Z-value P(>|z|)
y0 0.000
y1 0.000
y2 0.000
y3 0.000
i 0.490 0.048 10.151 0.000
s 0.264 0.046 5.802 0.000
Variances:
Estimate Std.Err Z-value P(>|z|)
y0 (rsvr) 0.238 0.011 22.361 0.000
y1 (rsvr) 0.238 0.011 22.361 0.000
y2 (rsvr) 0.238 0.011 22.361 0.000
y3 (rsvr) 0.238 0.011 22.361 0.000
i 0.998 0.074 13.478 0.000
s 0.988 0.066 15.076 0.000
Linear mixed model fit by REML ['lmerMod']
Formula: y1 ~ time + (1 + time | subject)
Data: d
REML criterion at convergence: 5833.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.36211 -0.48276 0.02046 0.47515 2.84524
Random effects:
Groups Name Variance Std.Dev. Corr
subject (Intercept) 0.9999 0.9999
time 0.9898 0.9949 0.21
Residual 0.2382 0.4881
Number of obs: 2000, groups: subject, 500
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.48986 0.04830 10.141
time 0.26400 0.04555 5.796
In addition, the random coefficients estimates from the mixed model perfectly correlate with those of the latent variables.
ranefLatent = data.frame(coef(mixedModel)[[1]], lavPredict(growthCurveModel))
ranefLatent %>% round(2) %>% head
X.Intercept. time i s
1 -1.12 0.72 -1.12 0.72
2 0.90 -0.61 0.90 -0.61
3 1.08 1.72 1.08 1.72
4 -1.86 -0.68 -1.86 -0.68
5 0.06 1.94 0.06 1.94
6 0.87 0.24 0.87 0.24
ranefLatent %>% cor %>% round(2)
X.Intercept. time i s
X.Intercept. 1.00 0.29 1.00 0.29
time 0.29 1.00 0.29 1.00
i 1.00 0.29 1.00 0.29
s 0.29 1.00 0.29 1.00
Both approaches allow those residuals to covary, though it gets tedious in SEM syntax, while it is a natural extension in the mixed model framework. Here is the syntax for letting each time point covary with the next.
model = "
# intercept and slope with fixed coefficients
i =~ 1*y0 + 1*y1 + 1*y2 + 1*y3
s =~ 0*y0 + 1*y1 + 2*y2 + 3*y3
# all of the following is needed for what are essentially only two parameters
# to estimate- resvar and correlation (the latter defined explicitly here)
y0 ~~ resvar*y0
y1 ~~ resvar*y1
y2 ~~ resvar*y2
y3 ~~ resvar*y3
# timepoints 1 step apart
y0 ~~ a*y1
y1 ~~ a*y2
y2 ~~ a*y3
# two steps apart
y0 ~~ b*y2
y1 ~~ b*y3
# three steps apart
y0 ~~ c*y3
# fix parameters according to ar1
b == a^2
c == a^3
"
lavaan (0.5-20) converged normally after 287 iterations
Number of observations 500
Estimator ML
Minimum Function Test Statistic 14.881
Degrees of freedom 7
P-value (Chi-square) 0.038
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
i =~
y0 1.000 0.980 0.884
y1 1.000 0.980 0.603
y2 1.000 0.980 0.399
y3 1.000 0.980 0.291
s =~
y0 0.000 0.000 0.000
y1 1.000 0.991 0.610
y2 2.000 1.983 0.808
y3 3.000 2.974 0.882
Covariances:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
y0 ~~
y1 (a) 0.029 0.021 1.397 0.162 0.029 0.110
y1 ~~
y2 (a) 0.029 0.021 1.397 0.162 0.029 0.110
y2 ~~
y3 (a) 0.029 0.021 1.397 0.162 0.029 0.110
y0 ~~
y2 (b) 0.001 0.001 0.698 0.485 0.001 0.003
y1 ~~
y3 (b) 0.001 0.001 0.698 0.485 0.001 0.003
y0 ~~
y3 (c) 0.000 0.000 0.466 0.641 0.000 0.000
i ~~
s 0.217 0.051 4.293 0.000 0.223 0.223
Intercepts:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
y0 0.000 0.000 0.000
y1 0.000 0.000 0.000
y2 0.000 0.000 0.000
y3 0.000 0.000 0.000
i 0.492 0.048 10.195 0.000 0.502 0.502
s 0.263 0.046 5.765 0.000 0.265 0.265
Variances:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
y0 (rsvr) 0.267 0.025 10.714 0.000 0.267 0.218
y1 (rsvr) 0.267 0.025 10.714 0.000 0.267 0.101
y2 (rsvr) 0.267 0.025 10.714 0.000 0.267 0.044
y3 (rsvr) 0.267 0.025 10.714 0.000 0.267 0.024
i 0.960 0.079 12.155 0.000 1.000 1.000
s 0.983 0.066 14.883 0.000 1.000 1.000
Constraints:
|Slack|
b - (a^2) 0.000
c - (a^3) 0.000
Linear mixed-effects model fit by maximum likelihood
Data: d
AIC BIC logLik
5836.589 5875.795 -2911.295
Random effects:
Formula: ~1 + time | subject
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.9768961 (Intr)
time 0.9907609 0.225
Residual 0.5215351
Correlation Structure: AR(1)
Formula: ~time | subject
Parameter estimate(s):
Phi
0.1235872
Fixed effects: y1 ~ time
Value Std.Error DF t-value p-value
(Intercept) 0.4918555 0.04827394 1499 10.188841 0
time 0.2628092 0.04559937 1499 5.763439 0
Correlation:
(Intr)
time 0.121
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.29559946 -0.45932855 0.01075219 0.45334542 2.76051403
Number of Observations: 2000
Number of Groups: 500
Thinking more generally about regression
In fact, your standard regression is already equipped to handle heterogeneous variances and a specific correlation structure for the residuals. In reality the linear model is the following model:
\[y \sim N(X\beta, \Sigma)\]
\(X\beta\) represents the linear predictor, i.e. the linear combination of your predictors, and a big, N by N covariance matrix \(\Sigma\). Thus the target variable \(y\) is multivariate normal with mean vector \(X\beta\) and covariance \(\Sigma\).
SLiMs assume that the covariance matrix is constant diagonal. A single value on the diagonal, \(\sigma^2\), and zeros on the off-diagonals. Mixed models, however, can allow the covariance structure to be specified in myriad ways, and it ties them to still other models, which in the end produces a very flexible modeling framework.
More on LGC
LGC are non-standard SEM
In no other SEM situation do you fix so many parameters or think about your latent variables in this manner. This can make for difficult interpretations relative to the mixed model (unless you are aware of the parallels).
Residual correlations
Typical models that would be investigated via the LGC have correlated residuals as depicted above.
Nonlinear time effect
A nonlinear time effect can be estimated if we don’t fix all the parameters for the slope factor. As an example, the following would actually estimate the loadings for times in between the first and last point.
s =~ 0*y0 + y1 + y2 + 1*y3
It may be difficult to assess nonlinear relationships unless one has many time points, and even then, one might get more with an additive mixed model approach.
Growth Mixture Models
Adding a latent categorical variable would allow for different trajectories across the latent groups. Most clients that I’ve seen typically did not have enough data to support it, as one essentially can be estimating a whole growth model for each group. Some might restrict certain parameters for certain groups, but given that the classes are a latent construct to be discovered, there would not be a theoretical justification to do so, and it would only complicate interpretation at best. Researchers rarely if ever predict test data, nor provide evidence that the clusters hold up with alternate data. In addition, it seems that typical interpretation of the classes takes on an ordered structure (e.g. low, medium, and high), which means they just have a coarsely measured continuous latent variable. Had they started under the assumption of a continuous latent variable, it might have made things easier to interpret and estimate.
As of this writing, Mplus is perhaps the only SEM software used for this, and it requires yet another syntax style, and, depending on the model you run, some of the most confusing output you’ll ever see in SEM. Alternatives in R include flexmix (demonstrated in the Mixture Models Module) for standard mixture modeling (including mixed effects models), as well as the R package OpenMx.
Other covariates
Cluster level
To add a cluster level covariate, for a mixed model, it looks something like this:
standard random intercept \[y = b0_c + b1*time + e \] \[b0_c = b0 + u_c\]
Plugging in becomes: \[y = b0 + b1*time + u_c + e \]
subject level covariate added \[b0_c = b0 + sex + u_c\]
But if we plug that into our level 1 model, it just becomes: \[y = b0 + sex + b1*time + u_c + e\]
In our previous modeling syntax it would look like this:
mixedModel = lmer(y1 ~ sex + time + (time|subject), data=d)
We’d have a fixed effect for sex and interpret it just like in the standard setting. Similarly, if we had a time-varying covariate, say socioeconomic status, it’d look like the following:
mixedModel = lmer(y1 ~ time + ses + (time|subject), data=d)
Though we could have a random slope for SES if we wanted. You get the picture. Most of the model is still standard regression interpretation.
With LGC, there is a tendency to interpret the model as an SEM, and certainly one can. But adding additional covariates typically causes confusion for those not familiar with mixed models. We literally do have to regress the intercept and slope latent variables on cluster level covariates as follows.
model.syntax <- '
# intercept and slope with fixed coefficients
i =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
s =~ 0*y1 + 1*y2 + 2*y3 + 3*y4
# regressions
i ~ x1 + x2
s ~ x1 + x2
'
Applied researchers commonly have difficulty on interpreting the model due to past experience with SEM. While these are latent variables, they aren’t just latent variables or underlying constructs. It doesn’t help that the output can be confusing, because now one has an ‘intercept for your intercepts’ and an ‘intercept for your slopes’. In the multilevel context it makes sense, but there you know ‘intercept’ is just ‘fixed effect’.
Time-varying covariates
With time varying covariates, the syntax starts to get tedious.
model.syntax <- '
# intercept and slope with fixed coefficients
i =~ 1*y1 + 1*y2 + 1*y3 + 1*y4
s =~ 0*y1 + 1*y2 + 2*y3 + 3*y4
# regressions
i ~ x1 + x2
s ~ x1 + x2
# time-varying covariates
y1 ~ c1
y2 ~ c2
y3 ~ c3
y4 ~ c4
'
fit <- growth(model.syntax, data=Demo.growth)
summary(fit)
Now imagine having just a few of those kinds of variables. In the mixed model framework one would add them in as any covariate in a regression model. In the LGC framework, one has to regress each time point for the target variable on its corresponding predictor time point. It might take a few paragraphs to explain the coefficients for just a handful of covariates.
Some Differences between Mixed Models and Growth Curves
Random slopes
One difference seen in comparing LGC models vs. mixed models is that in the former, random slopes are always assumed, whereas in the latter, one would typically see if it’s worth adding random slopes in the first place, or simply not assume them.
Wide vs. long
The SEM framework is inherently multivariate, and your data will need to be in wide format. This isn’t too big of a deal until you have many time-varying covariates, then the model syntax is tedious and you end up having the number of parameters to estimate climb rapidly. God help you if you want to investigate interactions based on time-varying covariates.
Sample size
As we have noted before, SEM is inherently a large sample technique. The mixed model does not require as much for standard approaches, but may require a lot more depending on the model one tries to estimate.
Number of time points
Mixed models can run even if some clusters have a single value. SEM requires balanced data and so one will always have to estimate missing values or drop them. Whether this missingness can be ignored in the standard mixed model framework is a matter of some debate in certain circles.
Time points
Numbering your time from zero makes sense in both worlds. This leads to the natural interpretation that the intercept is the mean for your first timepoint.
Other stuff
Here is an example of a parallel process in which we posit two growth curves at the same time, with possible correlations among them. This could be accomplished quite easily with a standard mixed model in the Bayesian framework, with a multivariate response, though I’ll have to come back to that later.
# parallel process --------------------------------------------------------
# let's simulate data with a negative slope average and positive correlation among intercepts and other process slopes
set.seed(1234)
n = 500
timepoints = 4
time = rep(0:3, times=n)
subject = rep(1:n, each=4)
# first we'll draw intercepts with overall mean .5 and -.5 for the two
# processes, and let them have a slight correlation. Their variance is 1.
intCorr = matrix(c(1,.2,.2,1), ncol=2)
colnames(intCorr) = rownames(intCorr) = c('i1', 'i2')
intCorr
interceptP1 = .5
interceptP2 = -.5
ranInts = MASS::mvrnorm(n, mu=c(0,0), Sigma = intCorr, empirical=T)
ranInts = data.frame(ranInts)
head(ranInts)
cor(ranInts)
colMeans(ranInts)
# now create slopes with intercept/mean .4, -.4, but the same positive
# relationship with their respective intercept. Their variances are also 1.
slopeP1 = .4
slopeP2 = -.4
s1 = .3*ranInts$i2 + rnorm(n)
s2 = .3*ranInts$i1 + rnorm(n)
ranef = data.frame(ranInts, s1, s2)
head(ranef)
# so we have slight positive correlations among all random intercepts and slopes
y1 = (interceptP1 + ranef$i1[subject]) + (slopeP1+ranef$s1[subject])*time + rnorm(n*timepoints, sd=.5)
d1 = data.frame(Subject=subject, time=time, y1)
head(d1)
library(ggplot2)
ggplot(aes(x=time, y=y1), data=d1) +
geom_line(aes(group=Subject), alpha=.1) +
geom_smooth(method='lm',color='red') +
lazerhawk::theme_trueMinimal()
y2 = (interceptP2 + ranef$i2[subject]) + (slopeP2+ranef$s2[subject])*time + rnorm(n*timepoints, sd=.5)
d2 = data.frame(Subject=subject, time=time, y2)
# process 2 shows the downward overall trend as expected
ggplot(aes(x=time, y=y2), data=d2) +
geom_line(aes(group=Subject), alpha=.1) +
geom_smooth(method='lm',color='red') +
lazerhawk::theme_trueMinimal()
# Widen from long form for lavaan
library(tidyr)
negslopepospath1 = d1 %>% spread(time, y1)
colnames(negslopepospath1) = c('Subject', paste0('y1', 1:4))
head(negslopepospath1)
negslopepospath2 = d2 %>% spread(time, y2)
colnames(negslopepospath2) = c('Subject', paste0('y2', 1:4))
# combine
dparallel = dplyr::left_join(negslopepospath1, negslopepospath2)
head(dparallel)
mainModel = "
i1 =~ 1*y11 + 1*y12 + 1*y13 + 1*y14
s1 =~ 0*y11 + 1*y12 + 2*y13 + 3*y14
i2 =~ 1*y21 + 1*y22 + 1*y23 + 1*y24
s2 =~ 0*y21 + 1*y22 + 2*y23 + 3*y24
s1 ~ i2
s2 ~ i1
"
library(lavaan)
mainRes = growth(mainModel, data=dparallel)
summary(mainRes)
fscores = lavPredict(mainRes)
broom::tidy(data.frame(fscores))
lm(s2~., fscores)
lazerhawk::corrheat(cor(fscores))
qplot(s1, i2, data=data.frame(fscores)) + geom_smooth(method='lm', se=F)
fv = lavPredict(mainRes, 'ov')
summary(mainRes, stdv)
d3heatmap::d3heatmap(cor(fv, fscores))
d3heatmap::d3heatmap(cor(select(dparallel, -Subject), ranef), Rowv = F, Colv = F)
process1Model = "
i1 =~ 1*y11 + 1*y12 + 1*y13 + 1*y14
s1 =~ 0*y11 + 1*y12 + 2*y13 + 3*y14
"
p1Res = growth(process1Model, data=dparallel)
fscoresP1 = lavPredict(p1Res)
process2Model = "
i2 =~ 1*y21 + 1*y22 + 1*y23 + 1*y24
s2 =~ 0*y21 + 1*y22 + 2*y23 + 3*y24
"
p2Res = growth(process2Model, data=dparallel)
fscoresP2 = lavPredict(p2Res)
fscoresSeparate = data.frame(fscoresP1, fscoresP2)
pathMod = "
s1 ~ i2
s2 ~ i1
i1~~i2
"
pathModRes = sem(pathMod, data=fscoresSeparate, fixed.x = F)
summary(pathModRes) # you'd have come to the same conclusions
summary(mainRes)
Summary
Growth curve modeling is an alternate way to do what is very commonly accomplished through mixed models, and allow for more complex models than typically seen for standard mixed models. One’s default should probably be to use the more common, and probably more flexible (in most situations), mixed modeling tools. However, the latent variable approach may provide what you need, and at the very least gives you a fresh take on the standard mixed model perspective.
Mixture Models
Thus far we have understood latent variables as possessing an underlying continuum, i.e. normally distributed with a mean of zero and some variance. This does not have to be the case, and instead we can posit a categorical variable. Some approaches you may in fact be already familiar with, as any modeling process under the heading of ‘cluster analysis’ could be said to deal with latent categorical variables. The issue is that we may feel that there is some underlying structure to the data that is described as discrete, and based on perhaps multiple variables.
We will approach this in the way that has been done from statistical and related motivations, rather than the SEM/psychometric approach. This will make clearer what it is we’re dealing with as well as not get bogged down in terminology. Furthermore, mixture models are typically poorly implemented within SEM, as many of the typical issues can often be magnified. The goal here as before is clarity of thought over ‘being able to run it’.
A common question in such analysis is how many clusters? There are many, many techniques for answering this question, and not a single one of them even remotely definitive. On the plus side, the good news is that we already know the answer, because the answer is always 1. However, that won’t stop us from trying to discover more than that, so here we go.
A Motivating Example
Take a look at the following data. It regards the waiting time between eruptions and the duration of the eruption (both in minutes) for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
library(ggplot2)
data("faithful")

Take a good look. This is probably the cleanest separation of clustered data you will likely ever see, and even so there are still data points that might fall into either cluster.
Create Clustered Data
To get a sense of mixture models, let’s actually create some data that might look like the Old Faithful data above. In the following we create something that will look like the eruptions variable in the faithful data. To do so, we draw one random sample from a normal distribution with a mean of 2, and the other with a mean of 4.5, and both get a standard deviation of .25. The first plot is based on the code below, the second on the actual data.
library(ggplot2)
set.seed(1234)
erupt1 = rnorm(150, mean=2, sd=.25)
erupt2 = rnorm(150, mean=4.5, sd=.25)
erupt = sample(c(erupt1, erupt2))


What do we have and what do we see? The data is a mixture of two normals, but we can think of the observations as belonging to a latent class, and each class has its own mean and standard deviation (and is based on a normal, but doesn’t have to be). Each observation has some likelihood, however great or small, of coming from either cluster, and had we really wanted to do more appropriate simulation, we would incorporate that information.
A basic approach for categorical latent variable analysis from a model based perspective:
- Posit the number of clusters you believe there to be
- For each observation, estimate those probability of coming from either cluster
- Assign observations to the most likely class (i.e. the one with the highest probability).
More advanced approaches might include:
- Predicting the latent classes in a manner akin to logistic regression
- Allow your model coefficients to vary based on cluster membership
- For example, have separate regression models for each class
Mixture modeling with Old Faithful
The following uses the flexmix package and function to estimate a regression model per class. In this case, our model includes only an intercept, and so is equivalent to estimating the mean and variance of each group. We posit k=2 groups.
We can see from the summary about 2/3 are classified to one group. We also get the estimated means and standard deviations for each group. Note that the group labels are completely arbitrary.
library(flexmix)
mod = flexmix(eruptions~1, data=faithful, k = 2)
summary(mod)
Call:
flexmix(formula = eruptions ~ 1, data = faithful, k = 2)
prior size post>0 ratio
Comp.1 0.652 177 190 0.932
Comp.2 0.348 95 98 0.969
'log Lik.' -276.3611 (df=5)
AIC: 562.7222 BIC: 580.7512
head(mod@posterior$scaled, 10) %>% round(4) # show some estimated probabilities
[,1] [,2]
[1,] 1.0000 0.0000
[2,] 0.0000 1.0000
[3,] 1.0000 0.0000
[4,] 0.0001 0.9999
[5,] 1.0000 0.0000
[6,] 0.8384 0.1616
[7,] 1.0000 0.0000
[8,] 1.0000 0.0000
[9,] 0.0000 1.0000
[10,] 1.0000 0.0000
parameters(mod) #means and std dev for each group
Comp.1 Comp.2
coef.(Intercept) 4.2735128 2.0187913
sigma 0.4376169 0.2363539
The first plot shows the estimated probabilities for each observation for one of the clusters (with some jitter), which in this case are all around 0 or 1. Again, you will probably never see anything like this, but clarity is useful here. The second plot shows the original data with their classification and contours of the density for each group.


SEM and Latent Categorical Variables
Dealing with categorical latent variables can be somewhat problematic. Interpreting one SEM model might be difficult enough, but then one might be allowing parts of it to change depending on which latent class observations belong to, while having to assess the latent class measurement model as well. It can be difficult to find a clarity of understanding from this process, as one is discovering classes then immediately assuming key differences among these classes they didn’t know existed beforehand. In addition, one will need even more data than standard SEM to deal with all the additional parameters that are allowed to vary across the classes.
Researchers also tend to find classes that represent ‘hi-lo’ or ‘hi-med-lo’ groups, which may suggest they should have left the latent construct as a continuous variable. When given the choice to discretize continuous variables in normal settings, it is rare in which the answer is anything but a resounding no. As such, one should think hard about the ultimate goals of such a model.
Terminology in SEM
Latent Class Analysis refers to dealing with categorical latent variables in the context of multivariate categorical data. For example one might have a series of yes/no questions on a survey, and want to discover categories of responses.
Latent Profile Analysis refers to dealing with categorical latent variables in the context of multivariate numerical data.
Latent Categories vs. Multigroup Analysis
The primary difference between the two is that one grouping structure actually exists in the data, for example, sex, race etc. In that case, a multigroup analysis would allow for separate SEM models per group. In the latent categorical variable situation, one must first discover the latent groups. In multigroup analysis, a common goal is to test measurement invariance, a concept which has several definitions itself. An example would be to see if the latent structure holds for an American vs. foreign sample, with the same items for the scale provided in the respective languages. This makes a lot of sense from a measurement model perspective and has some obvious use.
If one wants to see a similar situation for a less practically driven model, e.g. to see if an SEM model is the same for males vs. females, this is equivalent to having an interaction with the sex variable for every path in the model. The same holds for ‘subgroup analysis’ in typical settings, where you won’t find any more than you would by including the interactions of interest with the whole sample, though you will certainly have less data to work with. In any case, we need a lot of data to compare separate models where parameters are allowed to vary by group vs. a model in which they are fixed, and many simply do not have enough data for this.
Latent Trajectories
As noted in the growth curve modeling section, these are growth curve models in which intercepts and slopes are allowed to vary across latent groups as well as the clusters. The flexmix package used previously would allow one to estimate such models from the mixed model perspective, and might be preferred.
Estimation
If you would like to see the conceptual innards of estimating mixture models using EM Algorithm, see this link on my github page.
Appendix
Data set descriptions
McClelland
Description
- McClelland et al. (2013) abstract
This study examined relations between children’s attention span-persistence in preschool and later school achievement and college completion. Children were drawn from the Colorado Adoption Project using adopted and non-adopted children (N = 430). Results of structural equation modeling indicated that children’s age 4 attention span-persistence significantly predicted math and reading achievement at age 21 after controlling for achievement levels at age 7, adopted status, child vocabulary skills, gender, and maternal education level. Relations between attention span-persistence and later achievement were not fully mediated by age 7 achievement levels. Logistic regressions also revealed that age 4 attention span-persistence skills significantly predicted the odds of completing college by age 25. The majority of this relationship was direct and was not significantly mediated by math or reading skills at age 7 or age 21. Specifically, children who were rated one standard deviation higher on attention span-persistence at age 4 had 48.7% greater odds of completing college by age 25. Discussion focuses on the importance of children’s early attention span-persistence for later school achievement and educational attainment.
Reference
McClelland, Acock, Piccinin, Rheac, Stallings. (2013). Relations between preschool attention span-persistence and age 25 educational outcomes. link
National Longitudinal Survey of Youth (1997, NLSY97)
Description
NLSY97 consists of a nationally representative sample of approximately 9,000 youths who were 12 to 16 years old as of December 31, 1996. Round 1 of the survey took place in 1997. In that round, both the eligible youth and one of that youth’s parents received hour-long personal interviews. In addition, during the screening process, an extensive two-part questionnaire was administered that listed and gathered demographic information on members of the youth’s household and on his or her immediate family members living elsewhere. Youths are interviewed on an annual basis.
The NLSY97 is designed to document the transition from school to work and into adulthood. It collects extensive information about youths’ labor market behavior and educational experiences over time. Employment information focuses on two types of jobs, “employee” jobs where youths work for a particular employer, and “freelance” jobs such as lawn mowing and babysitting. These distinctions will enable researchers to study effects of very early employment among youths. Employment data include start and stop dates of jobs, occupation, industry, hours, earnings, job search, and benefits. Measures of work experience, tenure with an employer, and employer transitions can also be obtained. Educational data include youths’ schooling history, performance on standardized tests, course of study, the timing and types of degrees, and a detailed account of progression through post-secondary schooling.
Wheaton 1977 data
Description
Longitudinal data to develop a model of the stability of alienation from 1967 to 1971, accounting for socioeconomic status as a covariate. Each of the three factors have two indicator variables, SES in 1966 is measured by education and occupational status in 1966 and alienation in both years is measured by powerlessness and anomia.
Reference
Wheaton, B., Muthen B., Alwin, D., & Summers, G., 1977, “Assessing reliability and stability in panel models”, in D. R. Heise (Ed.), Sociological Methodology 1977 (pp. 84-136), San Francisco: Jossey-Bass, Inc.
Old Faithful
From the R helpfile
Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. A closer look at faithful$eruptions reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequent than expected under non-human measurement. For a better version of the eruption times, see the example below.
There are many versions of this dataset around: Azzalini and Bowman (1990) use a more complete version.
Harman 1974
Description
A correlation matrix of 24 psychological tests given to 145 seventh and eight-grade children in a Chicago suburb by Holzinger and Swineford.
Reference
Harman, H. H. (1976) Modern Factor Analysis, Third Edition Revised, University of Chicago Press, Table 7.4.
Terminology in SEM
SEM as it is known has been widely used in psychology and education for decades, while other disciplines have developed and advanced techniques that are related, but would not typically call them SEM. The following will be expanded over time.
Latent variable: an unobserved or hidden variable. It’s specific interpretation will depend on the modeling context.
Factor Analysis: in the SEM literature, this refers to a latent variable (measurement) model to assess the underlying construct behind the correlations among a set of observed variables. Elsewhere it may refer to a very broad family of matrix factorization techniques that would include things like principal components analysis, non-negative matrix factorization, etc.
Exploratory vs. Confirmatory: This distinction is problematic. Science and data analysis is inherently exploratory, and most who use SEM do some model exploration as they would with any other model. Some SEM models have more constraints than others, but that does not require a separate name or way of thinking about the model.
Mediation and moderation: These mean different things, both straightforward, and which is utilized in a model should be based on theoretical notions.
- Mediation: an indirect effect, e.g. A->B->C, A has an indirect effect on C. A can have a direct effect on C too.
- Moderation: an interaction (the same ones utilized in a standard regression modeling)
Fit: Model fit is something very difficult to ascertain in SEM, and notoriously problematic in this setting, where all proposed cutoffs for a good fit are ultimately arbitrary. Even if one had most fit indices suggesting a ‘good’ fit, there’s little indication the model has predictive capability.
Endo/Exogenous: Endogenous variables are determined by other variables, exogenous variables have no analyzed causes.
Disturbance: residual variance. Includes measurement error and unknown causes.
Mixture Models: models using categorical latent variables.
Resources
This list serves only as a starting point, though may be added to over time.
Graphical Models
UseR Series: Contains texts on graphical models, Bayesian networks, and network analysis.
Measurement Models
Personality Project: William Revelle’s website and text on psychometric theory.
Other SEM